Register Now

Login

Lost Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Captcha Click on image to update the captcha .

Add question

You must login to ask a question.

Login

Register Now

Lorem ipsum dolor sit amet, consectetur adipiscing elit.Morbi adipiscing gravdio, sit amet suscipit risus ultrices eu.Fusce viverra neque at purus laoreet consequa.Vivamus vulputate posuere nisl quis consequat.

INFP - An audio-driven dual-sided interactive video generation framework developed by ByteDance.

## Definition of INFP INFP is an audio-driven dual-sided interactive video generation framework developed by ByteDance. It generates real-time, natural-looking videos where characters dynamically respond to audio inputs without manual role switching. The framework supports multi-language audio, singing mode, and non-human avatars, optimized for applications like video conferencing. ## INFP's Operational Stages 1. **Motion-Based Head Imitation**: Projects facial behaviors from real conversations into a low-dimensional motion latent space to animate static images. 2. **Audio-Guided Motion Generation**: Maps dual-person audio inputs to motion latent codes via denoising, enabling audio-driven head movements in interactive scenarios. ## INFP's Performance Metrics INFP runs at over 40 FPS on Nvidia Tesla A10 GPUs, enabling real-time video generation for applications like instant messaging and live video conferencing. ## INFP's Dataset INFP introduces **DyConv**, a large-scale dataset of dyadic conversations collected from the internet, which includes separated audios and annotations for research. ## Accessibility of INFP As of now, INFP appears to be research-oriented with no public code or detailed usage guidelines. Its [official website](https://grisoon.github.io/INFP/) provides demonstrations but lacks installation instructions, suggesting limited accessibility. ## Comparison with DIM Unlike DIM (which requires manual role assignment), INFP dynamically adapts to conversational states (speaker/listener) based on audio input, eliminating the need for explicit role switching and producing more natural interactions. ## Features of INFP - **Motion Diversity**: Adapts outputs for the same image based on different audio inputs. - **Out-of-Distribution Support**: Works with non-human and side-face images. - **Real-Time Interaction**: Supports agent-agent and human-agent communication at >40 FPS. - **Multimodal Output**: Generates lip-synced talking heads, expressive listening behaviors, and singing animations. - **Multi-Language Support**: Processes audio inputs in various languages. ## Applications of INFP INFP is designed for real-time communication scenarios, such as virtual assistants, AI avatars, video conferencing, and interactive media, where natural, audio-driven character interactions are required. ### Citation sources: - [INFP](https://grisoon.github.io/INFP) - Official URL Updated: 2025-04-01