We reformulate open-ended long video generation as a retrieval-augmented generation (RAG) problem: treat previously generated latents as a dynamic, searchable memory, and let each new block retrieve the history it actually needs, instead of being trapped inside a drifting sliding window.
Causal-Forcing · Self-Forcing · LongLive
evaluated up to 120s
across all settings
fully training-free backbone
01 Abstract
Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away.
We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes.
Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory.
02 How it works
Self-generated latents become memory
As the AR transformer denoises each block, its completed latents are stored in a growing history pool, a searchable archive of everything the model has produced so far, far beyond the recent sliding window.
Adaptive selection via latent similarity
A lightweight latent encoder maps the latest latent to a compact 1024-dim query embedding, then retrieves the top-K most relevant historical latents by cosine similarity, adding only minor overhead relative to attention.
Implicit error correction
Retrieved context is concatenated as [sink ‖ retrieved ‖ local] attention context. When the recent window drifts off the video manifold, retrieval pulls the trajectory back, correcting error at every step, with the base generator untouched.
Window Temporal Delta Loss
Adjacent video latents are nearly identical, so a reconstruction-only encoder collapses neighbors to the same embedding and top-K retrieval returns near-duplicates. Our delta loss suppresses redundant local similarity while a second-order smoothness term keeps the embedding trajectory stable, yielding a search geometry tuned for non-local context selection, not generic compression.
03 Qualitative comparisons
Same prompt, matched horizon. In each column a competing long-video method (the backbone itself, ∞-RoPE, or Deep Forcing) sits above LongLive-RAG. Across all three AR backbones, retrieval keeps subjects and scenes consistent while baselines drift, shift color, or duplicate details.
Causal-Forcing backbone
Self-Forcing backbone
LongLive backbone
04 Results
VBench-Long evaluation. Each block fixes a base model and compares it with ∞-RoPE (positional extrapolation), Deep Forcing (compressed-history tokens), and LongLive-RAG. Bold marks the best value; Avg. Rank is averaged over six metrics (lower is better).
| Method | Subject Consist. ↑ |
Background Consist. ↑ |
Motion Smooth. ↑ |
Dynamic Degree ↑ |
Aesthetic Quality ↑ |
Imaging Quality ↑ |
Avg. Rank ↓ |
|---|---|---|---|---|---|---|---|
| Self-Forcing backbone | |||||||
| Self-Forcing | 96.21 | 95.39 | 98.39 | 52.03 | 56.69 | 63.31 | 3.17 |
| + ∞-RoPE | 97.32 | 96.38 | 98.59 | 46.82 | 56.78 | 63.93 | 2.00 |
| + Deep Forcing | 97.04 | 96.02 | 98.57 | 38.85 | 56.44 | 61.91 | 3.50 |
| + LongLive-RAG Ours | 97.57 | 96.56 | 98.76 | 42.24 | 57.17 | 65.43 | 1.33 |
| LongLive backbone | |||||||
| LongLive | 97.35 | 96.15 | 98.70 | 44.74 | 59.38 | 68.15 | 2.67 |
| + ∞-RoPE | 97.27 | 96.19 | 98.68 | 48.18 | 58.69 | 67.99 | 3.17 |
| + Deep Forcing | 97.52 | 96.43 | 98.82 | 41.46 | 59.00 | 67.61 | 2.50 |
| + LongLive-RAG Ours | 97.53 | 96.39 | 98.77 | 44.84 | 59.24 | 68.42 | 1.67 |
| Causal-Forcing backbone | |||||||
| Causal-Forcing | 94.60 | 94.68 | 96.56 | 73.96 | 54.58 | 65.53 | 3.00 |
| + ∞-RoPE | 93.93 | 94.11 | 96.21 | 90.83 | 55.42 | 68.26 | 2.33 |
| + Deep Forcing | 93.52 | 93.86 | 95.84 | 84.79 | 55.03 | 66.07 | 3.33 |
| + LongLive-RAG Ours | 95.43 | 94.79 | 97.16 | 82.29 | 57.31 | 70.07 | 1.33 |
| Self-Forcing backbone | |||||||
| Self-Forcing | 95.84 | 95.27 | 98.20 | 51.72 | 56.05 | 62.22 | 3.33 |
| + ∞-RoPE | 97.24 | 96.24 | 98.58 | 46.64 | 56.09 | 63.28 | 2.17 |
| + Deep Forcing | 96.08 | 95.38 | 98.24 | 41.44 | 56.68 | 60.81 | 3.17 |
| + LongLive-RAG Ours | 97.60 | 96.51 | 98.70 | 44.69 | 57.19 | 64.97 | 1.33 |
| LongLive backbone | |||||||
| LongLive | 97.13 | 95.89 | 98.61 | 44.56 | 58.17 | 67.56 | 2.83 |
| + ∞-RoPE | 97.00 | 95.85 | 98.53 | 53.36 | 57.48 | 66.94 | 3.33 |
| + Deep Forcing | 97.17 | 96.04 | 98.73 | 45.13 | 57.48 | 67.27 | 2.50 |
| + LongLive-RAG Ours | 97.32 | 96.08 | 98.62 | 49.90 | 58.30 | 67.79 | 1.33 |
| Causal-Forcing backbone | |||||||
| Causal-Forcing | 93.52 | 94.12 | 95.74 | 72.32 | 51.24 | 62.30 | 3.83 |
| + ∞-RoPE | 93.81 | 93.78 | 96.09 | 92.47 | 54.42 | 67.50 | 2.50 |
| + Deep Forcing | 94.27 | 94.18 | 96.62 | 78.59 | 52.12 | 64.25 | 2.33 |
| + LongLive-RAG Ours | 94.29 | 94.24 | 96.48 | 88.20 | 54.95 | 68.16 | 1.33 |
| Self-Forcing backbone | |||||||
| Self-Forcing | 96.12 | 95.32 | 98.27 | 43.39 | 55.64 | 61.57 | 3.33 |
| + ∞-RoPE | 97.15 | 96.09 | 98.55 | 46.29 | 55.11 | 61.81 | 2.33 |
| + Deep Forcing | 96.92 | 96.83 | 98.97 | 15.23 | 52.84 | 57.93 | 2.83 |
| + LongLive-RAG Ours | 97.64 | 96.40 | 98.75 | 44.10 | 56.30 | 64.16 | 1.50 |
| LongLive backbone | |||||||
| LongLive | 96.93 | 95.64 | 98.58 | 47.12 | 57.90 | 66.95 | 2.50 |
| + ∞-RoPE | 96.81 | 95.65 | 98.48 | 53.59 | 56.73 | 66.19 | 3.17 |
| + Deep Forcing | 97.17 | 95.72 | 98.57 | 46.03 | 56.98 | 66.03 | 3.00 |
| + LongLive-RAG Ours | 97.22 | 95.88 | 98.62 | 50.25 | 57.57 | 66.95 | 1.33 |
| Causal-Forcing backbone | |||||||
| Causal-Forcing | 92.98 | 94.66 | 95.41 | 63.79 | 47.31 | 58.23 | 3.33 |
| + ∞-RoPE | 93.45 | 93.56 | 95.95 | 93.44 | 53.47 | 66.81 | 2.17 |
| + Deep Forcing | 93.19 | 94.04 | 95.60 | 76.45 | 46.86 | 61.58 | 3.17 |
| + LongLive-RAG Ours | 94.38 | 94.08 | 96.56 | 90.21 | 54.82 | 68.23 | 1.33 |
Across three AR backbones and 30s / 60s / 120s horizons, LongLive-RAG achieves the best average VBench-Long rank and consistently improves subject consistency, background consistency, motion smoothness, and imaging quality.
05 BibTeX
@article{hu2026longliverag,
title = {LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation},
author = {Hu, Qixin and Yang, Shuai and Huang, Wei and Chen, Yukang and Han, Song},
journal = {arXiv preprint arXiv:2606.02553},
year = {2026},
eprint = {2606.02553},
archivePrefix = {arXiv}
}