Preprint  ·  NVIDIA  ·  USC  ·  MIT

LongLive-RAG

A General Retrieval-Augmented Framework for Long Video Generation

1 NVIDIA 2 USC 3 MIT
* Work done during an internship at NVIDIA.
Baseline
LongLive-RAG
Baseline
LongLive-RAG
Speed

Same prompt, 30-second generation. Across both examples the baseline drifts and duplicates the subject, while LongLive-RAG keeps the subject and scene consistent across the full horizon.

The one-liner

We reformulate open-ended long video generation as a retrieval-augmented generation (RAG) problem: treat previously generated latents as a dynamic, searchable memory, and let each new block retrieve the history it actually needs, instead of being trapped inside a drifting sliding window.

0
AR backbones improved
Causal-Forcing · Self-Forcing · LongLive
0s
Generation horizon
evaluated up to 120s
#0
Best average VBench-Long rank
across all settings
0×
Changes to the base generator
fully training-free backbone

01 Abstract

Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away.

We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes.

Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory.

02 How it works

1

Self-generated latents become memory

As the AR transformer denoises each block, its completed latents are stored in a growing history pool, a searchable archive of everything the model has produced so far, far beyond the recent sliding window.

2

Adaptive selection via latent similarity

A lightweight latent encoder maps the latest latent to a compact 1024-dim query embedding, then retrieves the top-K most relevant historical latents by cosine similarity, adding only minor overhead relative to attention.

3

Implicit error correction

Retrieved context is concatenated as [sink ‖ retrieved ‖ local] attention context. When the recent window drifts off the video manifold, retrieval pulls the trajectory back, correcting error at every step, with the base generator untouched.

Overview of LongLive-RAG: AR pipeline, adaptive selection, and error trajectory
Overview of LongLive-RAG. (1) The AR video diffusion transformer produces a generated latent and assembles context tokens. (2) A latent encoder maps the latent to a compact query embedding and searches the retrieval pool. (3) Retrieved latents provide context; when drift enters the trajectory, retrieval provides implicit correction at each step.
Δ

Window Temporal Delta Loss

Adjacent video latents are nearly identical, so a reconstruction-only encoder collapses neighbors to the same embedding and top-K retrieval returns near-duplicates. Our delta loss suppresses redundant local similarity while a second-order smoothness term keeps the embedding trajectory stable, yielding a search geometry tuned for non-local context selection, not generic compression.

03 Qualitative comparisons

Same prompt, matched horizon. In each column a competing long-video method (the backbone itself, ∞-RoPE, or Deep Forcing) sits above LongLive-RAG. Across all three AR backbones, retrieval keeps subjects and scenes consistent while baselines drift, shift color, or duplicate details.

Causal-Forcing backbone

Speed
Causal-Forcing
LongLive-RAG
∞-RoPE
LongLive-RAG
Deep Forcing
LongLive-RAG

Self-Forcing backbone

Speed
Self-Forcing
LongLive-RAG
∞-RoPE
LongLive-RAG
Deep Forcing
LongLive-RAG

LongLive backbone

Speed
LongLive
LongLive-RAG
∞-RoPE
LongLive-RAG
Deep Forcing
LongLive-RAG

04 Results

VBench-Long evaluation. Each block fixes a base model and compares it with ∞-RoPE (positional extrapolation), Deep Forcing (compressed-history tokens), and LongLive-RAG. Bold marks the best value; Avg. Rank is averaged over six metrics (lower is better).

Method Subject
Consist. ↑
Background
Consist. ↑
Motion
Smooth. ↑
Dynamic
Degree ↑
Aesthetic
Quality ↑
Imaging
Quality ↑
Avg.
Rank ↓
Self-Forcing backbone
Self-Forcing96.2195.3998.3952.0356.6963.313.17
+ ∞-RoPE97.3296.3898.5946.8256.7863.932.00
+ Deep Forcing97.0496.0298.5738.8556.4461.913.50
+ LongLive-RAG Ours97.5796.5698.7642.2457.1765.431.33
LongLive backbone
LongLive97.3596.1598.7044.7459.3868.152.67
+ ∞-RoPE97.2796.1998.6848.1858.6967.993.17
+ Deep Forcing97.5296.4398.8241.4659.0067.612.50
+ LongLive-RAG Ours97.5396.3998.7744.8459.2468.421.67
Causal-Forcing backbone
Causal-Forcing94.6094.6896.5673.9654.5865.533.00
+ ∞-RoPE93.9394.1196.2190.8355.4268.262.33
+ Deep Forcing93.5293.8695.8484.7955.0366.073.33
+ LongLive-RAG Ours95.4394.7997.1682.2957.3170.071.33

Across three AR backbones and 30s / 60s / 120s horizons, LongLive-RAG achieves the best average VBench-Long rank and consistently improves subject consistency, background consistency, motion smoothness, and imaging quality.

05 BibTeX

@article{hu2026longliverag,
  title   = {LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation},
  author  = {Hu, Qixin and Yang, Shuai and Huang, Wei and Chen, Yukang and Han, Song},
  journal = {arXiv preprint arXiv:2606.02553},
  year    = {2026},
  eprint  = {2606.02553},
  archivePrefix = {arXiv}
}