LCT: Long Context Tuning for Video Generation - A scene-level video generation framework enhancing multi-shot coherence through extended context windows.
## Objective of LCT Framework
The primary goal of LCT is to bridge the gap between current single-shot video generation capabilities and real-world narrative video production (e.g., films, TV shows) by extending the context window to entire scenes while maintaining visual and dynamic consistency across multiple shots.
## Architectural Innovations in LCT
Key architectural modifications include:
- Long-context MMDiT blocks with full attention mechanisms covering all text and video tokens
- Interleaved 3D Rotary Position Embedding (RoPE) to distinguish between different shots
- Asynchronous timestep strategy supporting both joint denoising and conditional generation
## Training Methodology of LCT
LCT's training involves:
- Simultaneous training on single-shot and scene-level data
- Support for up to 9-shot context windows
- 3 billion parameter pre-trained model
- 135,000 iterations on 128 NVIDIA H800 GPUs
- Subsequent causal fine-tuning for 9,000 iterations
- Output resolution of 480×480 pixels
## Generation Modes in LCT
LCT offers two generation modes:
1. **Joint Denoising**: Simultaneous generation of all shots using bidirectional attention
2. **Autoregressive Generation**: Sequential shot generation using context-causal attention with KV-cache optimization (typically at t=100 to t=500 timesteps)
## Applications of LCT Technology
Practical applications include:
- Minute-long single-shot video extension through autoregressive 10-second segment generation
- Interactive multi-shot development allowing directors to refine content progressively
- Conditional generation based on identity/environment images
- Narrative video production with maintained visual consistency
## Hardware Specifications for LCT
LCT requires:
- 128 NVIDIA H800 GPUs for training
- No additional parameters beyond the base 3B parameter model
- Supports efficient inference through KV-cache optimization in autoregressive mode
## LCT Reference Materials
Primary technical resources include:
- Research paper: [Long Context Tuning for Video Generation](https://arxiv.org/pdf/2503.10589)
- Author homepage: [Yuwei Guo](https://guoyww.github.io/)
- (Note: Project page was inaccessible at time of documentation)
### Citation sources:
- [LCT: Long Context Tuning for Video Generation](https://arxiv.org/pdf/2503.10589) - Official URL
Updated: 2025-04-01