LCT: Long Context Tuning for Video Generation - A scene-level video generation framework enhancing multi-shot coherence through extended context windows.

Add question

You must login to ask a question.

LCT: Long Context Tuning for Video Generation - A scene-level video generation framework enhancing multi-shot coherence through extended context windows.

## Objective of LCT Framework The primary goal of LCT is to bridge the gap between current single-shot video generation capabilities and real-world narrative video production (e.g., films, TV shows) by extending the context window to entire scenes while maintaining visual and dynamic consistency across multiple shots. ## Architectural Innovations in LCT Key architectural modifications include: - Long-context MMDiT blocks with full attention mechanisms covering all text and video tokens - Interleaved 3D Rotary Position Embedding (RoPE) to distinguish between different shots - Asynchronous timestep strategy supporting both joint denoising and conditional generation ## Training Methodology of LCT LCT's training involves: - Simultaneous training on single-shot and scene-level data - Support for up to 9-shot context windows - 3 billion parameter pre-trained model - 135,000 iterations on 128 NVIDIA H800 GPUs - Subsequent causal fine-tuning for 9,000 iterations - Output resolution of 480×480 pixels ## Generation Modes in LCT LCT offers two generation modes: 1. **Joint Denoising**: Simultaneous generation of all shots using bidirectional attention 2. **Autoregressive Generation**: Sequential shot generation using context-causal attention with KV-cache optimization (typically at t=100 to t=500 timesteps) ## Applications of LCT Technology Practical applications include: - Minute-long single-shot video extension through autoregressive 10-second segment generation - Interactive multi-shot development allowing directors to refine content progressively - Conditional generation based on identity/environment images - Narrative video production with maintained visual consistency ## Hardware Specifications for LCT LCT requires: - 128 NVIDIA H800 GPUs for training - No additional parameters beyond the base 3B parameter model - Supports efficient inference through KV-cache optimization in autoregressive mode ## LCT Reference Materials Primary technical resources include: - Research paper: [Long Context Tuning for Video Generation](https://arxiv.org/pdf/2503.10589) - Author homepage: [Yuwei Guo](https://guoyww.github.io/) - (Note: Project page was inaccessible at time of documentation) ### Citation sources: - [LCT: Long Context Tuning for Video Generation](https://arxiv.org/pdf/2503.10589) - Official URL Updated: 2025-04-01

Register Now

Login

Lost Password

Add question

Login

Register Now

LCT: Long Context Tuning for Video Generation - A scene-level video generation framework enhancing multi-shot coherence through extended context windows.

LCT: Long Context Tuning for Video Generation - A scene-level video generation framework enhancing multi-shot coherence through extended context windows.