Register Now

Login

Lost Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Captcha Click on image to update the captcha .

Add question

You must login to ask a question.

Login

Register Now

Lorem ipsum dolor sit amet, consectetur adipiscing elit.Morbi adipiscing gravdio, sit amet suscipit risus ultrices eu.Fusce viverra neque at purus laoreet consequa.Vivamus vulputate posuere nisl quis consequat.

LCT: Long Context Tuning for Video Generation - A scene-level video generation framework enhancing multi-shot coherence through extended context windows.

## Objective of LCT Framework The primary goal of LCT is to bridge the gap between current single-shot video generation capabilities and real-world narrative video production (e.g., films, TV shows) by extending the context window to entire scenes while maintaining visual and dynamic consistency across multiple shots. ## Architectural Innovations in LCT Key architectural modifications include: - Long-context MMDiT blocks with full attention mechanisms covering all text and video tokens - Interleaved 3D Rotary Position Embedding (RoPE) to distinguish between different shots - Asynchronous timestep strategy supporting both joint denoising and conditional generation ## Training Methodology of LCT LCT's training involves: - Simultaneous training on single-shot and scene-level data - Support for up to 9-shot context windows - 3 billion parameter pre-trained model - 135,000 iterations on 128 NVIDIA H800 GPUs - Subsequent causal fine-tuning for 9,000 iterations - Output resolution of 480×480 pixels ## Generation Modes in LCT LCT offers two generation modes: 1. **Joint Denoising**: Simultaneous generation of all shots using bidirectional attention 2. **Autoregressive Generation**: Sequential shot generation using context-causal attention with KV-cache optimization (typically at t=100 to t=500 timesteps) ## Applications of LCT Technology Practical applications include: - Minute-long single-shot video extension through autoregressive 10-second segment generation - Interactive multi-shot development allowing directors to refine content progressively - Conditional generation based on identity/environment images - Narrative video production with maintained visual consistency ## Hardware Specifications for LCT LCT requires: - 128 NVIDIA H800 GPUs for training - No additional parameters beyond the base 3B parameter model - Supports efficient inference through KV-cache optimization in autoregressive mode ## LCT Reference Materials Primary technical resources include: - Research paper: [Long Context Tuning for Video Generation](https://arxiv.org/pdf/2503.10589) - Author homepage: [Yuwei Guo](https://guoyww.github.io/) - (Note: Project page was inaccessible at time of documentation) ### Citation sources: - [LCT: Long Context Tuning for Video Generation](https://arxiv.org/pdf/2503.10589) - Official URL Updated: 2025-04-01