LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

The one-liner

We reformulate open-ended long video generation as a retrieval-augmented generation (RAG) problem: treat previously generated latents as a dynamic, searchable memory, and let each new block retrieve the history it actually needs, instead of being trapped inside a drifting sliding window.

0

AR backbones improved
Causal-Forcing · Self-Forcing · LongLive

0s

Generation horizon
evaluated up to 120s

#0

Best average VBench-Long rank
across all settings

0×

Changes to the base generator
fully training-free backbone

01 Abstract

Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away.

We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes.

Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory.

02 How it works

1

Self-generated latents become memory

As the AR transformer denoises each block, its completed latents are stored in a growing history pool, a searchable archive of everything the model has produced so far, far beyond the recent sliding window.

2

Adaptive selection via latent similarity

A lightweight latent encoder maps the latest latent to a compact 1024-dim query embedding, then retrieves the top-K most relevant historical latents by cosine similarity, adding only minor overhead relative to attention.

3

Implicit error correction

Retrieved context is concatenated as [sink ‖ retrieved ‖ local] attention context. When the recent window drifts off the video manifold, retrieval pulls the trajectory back, correcting error at every step, with the base generator untouched.

Overview of LongLive-RAG: AR pipeline, adaptive selection, and error trajectory — **Overview of LongLive-RAG.** (1) The AR video diffusion transformer produces a generated latent and assembles context tokens. (2) A latent encoder maps the latent to a compact query embedding and searches the retrieval pool. (3) Retrieved latents provide context; when drift enters the trajectory, retrieval provides implicit correction at each step.

Δ

Window Temporal Delta Loss

Adjacent video latents are nearly identical, so a reconstruction-only encoder collapses neighbors to the same embedding and top-K retrieval returns near-duplicates. Our delta loss suppresses redundant local similarity while a second-order smoothness term keeps the embedding trajectory stable, yielding a search geometry tuned for non-local context selection, not generic compression.

03 Qualitative comparisons

Same prompt, matched horizon. In each column a competing long-video method (the backbone itself, ∞-RoPE, or Deep Forcing) sits above LongLive-RAG. Across all three AR backbones, retrieval keeps subjects and scenes consistent while baselines drift, shift color, or duplicate details.

Causal-Forcing backbone

Speed

Causal-Forcing

LongLive-RAG

∞-RoPE

LongLive-RAG

Deep Forcing

LongLive-RAG

Self-Forcing backbone

Speed

Self-Forcing

LongLive-RAG

∞-RoPE

LongLive-RAG

Deep Forcing

LongLive-RAG

LongLive backbone

Speed

LongLive

LongLive-RAG

∞-RoPE

LongLive-RAG

Deep Forcing

LongLive-RAG

04 Results

VBench-Long evaluation. Each block fixes a base model and compares it with ∞-RoPE (positional extrapolation), Deep Forcing (compressed-history tokens), and LongLive-RAG. Bold marks the best value; Avg. Rank is averaged over six metrics (lower is better).

Method	Subject Consist. ↑	Background Consist. ↑	Motion Smooth. ↑	Dynamic Degree ↑	Aesthetic Quality ↑	Imaging Quality ↑	Avg. Rank ↓
Self-Forcing backbone
Self-Forcing	96.21	95.39	98.39	52.03	56.69	63.31	3.17
+ ∞-RoPE	97.32	96.38	98.59	46.82	56.78	63.93	2.00
+ Deep Forcing	97.04	96.02	98.57	38.85	56.44	61.91	3.50
+ LongLive-RAG Ours	97.57	96.56	98.76	42.24	57.17	65.43	1.33
LongLive backbone
LongLive	97.35	96.15	98.70	44.74	59.38	68.15	2.67
+ ∞-RoPE	97.27	96.19	98.68	48.18	58.69	67.99	3.17
+ Deep Forcing	97.52	96.43	98.82	41.46	59.00	67.61	2.50
+ LongLive-RAG Ours	97.53	96.39	98.77	44.84	59.24	68.42	1.67
Causal-Forcing backbone
Causal-Forcing	94.60	94.68	96.56	73.96	54.58	65.53	3.00
+ ∞-RoPE	93.93	94.11	96.21	90.83	55.42	68.26	2.33
+ Deep Forcing	93.52	93.86	95.84	84.79	55.03	66.07	3.33
+ LongLive-RAG Ours	95.43	94.79	97.16	82.29	57.31	70.07	1.33
Self-Forcing backbone
Self-Forcing	95.84	95.27	98.20	51.72	56.05	62.22	3.33
+ ∞-RoPE	97.24	96.24	98.58	46.64	56.09	63.28	2.17
+ Deep Forcing	96.08	95.38	98.24	41.44	56.68	60.81	3.17
+ LongLive-RAG Ours	97.60	96.51	98.70	44.69	57.19	64.97	1.33
LongLive backbone
LongLive	97.13	95.89	98.61	44.56	58.17	67.56	2.83
+ ∞-RoPE	97.00	95.85	98.53	53.36	57.48	66.94	3.33
+ Deep Forcing	97.17	96.04	98.73	45.13	57.48	67.27	2.50
+ LongLive-RAG Ours	97.32	96.08	98.62	49.90	58.30	67.79	1.33
Causal-Forcing backbone
Causal-Forcing	93.52	94.12	95.74	72.32	51.24	62.30	3.83
+ ∞-RoPE	93.81	93.78	96.09	92.47	54.42	67.50	2.50
+ Deep Forcing	94.27	94.18	96.62	78.59	52.12	64.25	2.33
+ LongLive-RAG Ours	94.29	94.24	96.48	88.20	54.95	68.16	1.33
Self-Forcing backbone
Self-Forcing	96.12	95.32	98.27	43.39	55.64	61.57	3.33
+ ∞-RoPE	97.15	96.09	98.55	46.29	55.11	61.81	2.33
+ Deep Forcing	96.92	96.83	98.97	15.23	52.84	57.93	2.83
+ LongLive-RAG Ours	97.64	96.40	98.75	44.10	56.30	64.16	1.50
LongLive backbone
LongLive	96.93	95.64	98.58	47.12	57.90	66.95	2.50
+ ∞-RoPE	96.81	95.65	98.48	53.59	56.73	66.19	3.17
+ Deep Forcing	97.17	95.72	98.57	46.03	56.98	66.03	3.00
+ LongLive-RAG Ours	97.22	95.88	98.62	50.25	57.57	66.95	1.33
Causal-Forcing backbone
Causal-Forcing	92.98	94.66	95.41	63.79	47.31	58.23	3.33
+ ∞-RoPE	93.45	93.56	95.95	93.44	53.47	66.81	2.17
+ Deep Forcing	93.19	94.04	95.60	76.45	46.86	61.58	3.17
+ LongLive-RAG Ours	94.38	94.08	96.56	90.21	54.82	68.23	1.33

Across three AR backbones and 30s / 60s / 120s horizons, LongLive-RAG achieves the best average VBench-Long rank and consistently improves subject consistency, background consistency, motion smoothness, and imaging quality.

05 BibTeX

@article{hu2026longliverag,
  title   = {LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation},
  author  = {Hu, Qixin and Yang, Shuai and Huang, Wei and Chen, Yukang and Han, Song},
  journal = {arXiv preprint arXiv:2606.02553},
  year    = {2026},
  eprint  = {2606.02553},
  archivePrefix = {arXiv}
}