Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
Reduce inference cost for hour-scale video by roughly two orders of magnitude, enabling long-video features on single GPUs and lowering hosting and latency costs.
Summary TLDR
The paper introduces HiCo, a two-stage hierarchical compression that turns long videos into compact token sequences (about 16 tokens/frame, ~1/50 of common dense representations) with little performance loss. Paired with a short-to-long training schedule and LongVid (a 300k-hour long-video corpus), the resulting VideoChat-Flash models (2B/7B) run far cheaper and can do inference on 10,000 frames on a single A100. They also add a harder Multi-Hop Needle-in-a-Video-Haystack benchmark. Key wins: extreme token savings, large FLOPs/memory reduction, 99.1% single-hop retrieval at 10k frames, and improved general benchmark scores.
Problem Statement
Long videos produce huge, redundant token sequences that blow up compute and memory. Existing fixes either naively enlarge context windows (very costly) or over-compress frames (lose detail). The field needs an approach that reduces cost enough to handle hour-scale video while keeping the fine-grained information needed for reasoning and retrieval.
Main Contribution
HiCo: a hierarchical two-stage video compression (clip-level token merging + video-level progressive dropout) that compresses to ~16 tokens/frame.
LongVid: a long-video instruction dataset assembled from public sources (reported 300k hours and 2B words) for long-form training.
Short-to-long multi-stage training recipe that mixes image, short-video, and long-video stages.
Multi-Hop Needle-in-a-Video-Haystack (MH-NIAH): a more robust, multi-step retrieval+reasoning benchmark.
VideoChat-Flash models (2B and 7B) that deliver SOTA open-source performance and much lower FLOPs for long-video inference.
Key Findings
HiCo compresses each frame to about 16 tokens (≈2% of dense tokenization) with almost no performance loss.
Huge compute reduction enables single-GPU inference on very long videos.
State-of-the-art open-source long-video retrieval on single-hop NIAH over 10k frames.
Multi-hop long-video reasoning is still hard; MH-NIAH reveals large gaps.
Short-to-long training and duration-based sampling materially boost performance.
Results
Accuracy
Multi-Hop NIAH CAP/QA
Tokens per frame used by VideoChat-Flash
FLOPs for extreme long inference (10,000 frames)
MVBench average (7B VideoChat-Flash @448)
Who Should Care
What To Try In 7 Days
Prototype clip-level token merging (ToMe) on your video encoder and measure tokens/frame and downstream QA.
Apply duration-based sampling + timestamp prompts to your inference pipeline to balance short/long video detail.
Benchmark chained retrieval with MH-NIAH-style multi-hop probes before shipping long-video workflows.
Agent Features
Memory
- long-context handling via compressed tokens
Architectures
- hierarchical compression
- spatio-temporal encoder + connector
- LLM layer-wise visual dropout
Optimization Features
Token Efficiency
- 16 tokens/frame (~1/50)
- duration-based sampling (dense short, sparse long)
Infra Optimization
- enables single-A100 inference on 10k frames; large FLOPs reductions
Model Optimization
- clip-level token merging (ToMe)
- spatio-temporal attention in video encoder (UMT-L)
System Optimization
- high-resolution post-finetune (224→448) while freezing LLM
Training Optimization
- multi-stage short-to-long curriculum
- mix of short and long instruction tuning
Inference Optimization
- progressive visual dropout (uniform shallow, attention-based deep)
- video-level token selection only at inference
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Extreme compression can still lose rare fine-grained visual cues in complex multi-hop tasks.
- Video-level compression is used only at inference due to training-compatibility issues (sequence parallelism).
- Multi-hop reasoning accuracy remains low in absolute terms despite improvements.
When Not To Use
- When exact per-frame pixel reconstruction or dense frame-level outputs are required.
- When your training infra depends on sequence-parallel optimizations incompatible with video-level dropout.
Failure Modes
- Dropping tokens that contain the single crucial frame for a multi-hop chain.
- Bias introduced by attention selection in shallow layers if attention is not yet reliable.
- Evaluation leakage if benchmark images/questions overlap training sources (paper notes this risk for NIAH-style tests).
Core Entities
Models
- VideoChat-Flash
- VideoChat-Flash @224
- VideoChat-Flash @448
- Qwen2-7B
- UMT-L
- InternVideo2-1B
- LongVA
- LLaMA-VID
- LongVILA
- GPT-4o
- Gemini-1.5-Pro
Metrics
- tokens per frame
- FLOPs (TFLOPs)
- inference memory (GB)
- Accuracy
- CAP/QA (MH-NIAH)
- MVBench average
- mIoU (Charades-STA)
Datasets
- LongVid (constructed, reported 300k hours)
- Ego4D
- HowTo100M
- HD-VILA
- MiraData
- WebVid2M
- MS-COCO (images used in NIAH)
Benchmarks
- MVBench
- Perception Test
- LongVideoBench
- MLVU
- VideoMME
- LVBench
- Charades-STA
- AuroraCap
- Needle-in-a-Video-Haystack (NIAH)
- Multi-Hop NIAH (MH-NIAH)

