INFP - An audio-driven dual-sided interactive video generation framework developed by ByteDance.
## Definition of INFP
INFP is an audio-driven dual-sided interactive video generation framework developed by ByteDance. It generates real-time, natural-looking videos where characters dynamically respond to audio inputs without manual role switching. The framework supports multi-language audio, singing mode, and non-human avatars, optimized for applications like video conferencing.
## INFP's Operational Stages
1. **Motion-Based Head Imitation**: Projects facial behaviors from real conversations into a low-dimensional motion latent space to animate static images.
2. **Audio-Guided Motion Generation**: Maps dual-person audio inputs to motion latent codes via denoising, enabling audio-driven head movements in interactive scenarios.
## INFP's Performance Metrics
INFP runs at over 40 FPS on Nvidia Tesla A10 GPUs, enabling real-time video generation for applications like instant messaging and live video conferencing.
## INFP's Dataset
INFP introduces **DyConv**, a large-scale dataset of dyadic conversations collected from the internet, which includes separated audios and annotations for research.
## Accessibility of INFP
As of now, INFP appears to be research-oriented with no public code or detailed usage guidelines. Its [official website](https://grisoon.github.io/INFP/) provides demonstrations but lacks installation instructions, suggesting limited accessibility.
## Comparison with DIM
Unlike DIM (which requires manual role assignment), INFP dynamically adapts to conversational states (speaker/listener) based on audio input, eliminating the need for explicit role switching and producing more natural interactions.
## Features of INFP
- **Motion Diversity**: Adapts outputs for the same image based on different audio inputs.
- **Out-of-Distribution Support**: Works with non-human and side-face images.
- **Real-Time Interaction**: Supports agent-agent and human-agent communication at >40 FPS.
- **Multimodal Output**: Generates lip-synced talking heads, expressive listening behaviors, and singing animations.
- **Multi-Language Support**: Processes audio inputs in various languages.
## Applications of INFP
INFP is designed for real-time communication scenarios, such as virtual assistants, AI avatars, video conferencing, and interactive media, where natural, audio-driven character interactions are required.
### Citation sources:
- [INFP](https://grisoon.github.io/INFP) - Official URL
Updated: 2025-04-01