HiCo compresses hours of video to ~1/50 tokens so MLLMs can efficiently reason over 10k+ frames

December 31, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, Limin Wang

Links

Abstract / PDF

Why It Matters For Business

Reduce inference cost for hour-scale video by roughly two orders of magnitude, enabling long-video features on single GPUs and lowering hosting and latency costs.

Summary TLDR

The paper introduces HiCo, a two-stage hierarchical compression that turns long videos into compact token sequences (about 16 tokens/frame, ~1/50 of common dense representations) with little performance loss. Paired with a short-to-long training schedule and LongVid (a 300k-hour long-video corpus), the resulting VideoChat-Flash models (2B/7B) run far cheaper and can do inference on 10,000 frames on a single A100. They also add a harder Multi-Hop Needle-in-a-Video-Haystack benchmark. Key wins: extreme token savings, large FLOPs/memory reduction, 99.1% single-hop retrieval at 10k frames, and improved general benchmark scores.

Problem Statement

Long videos produce huge, redundant token sequences that blow up compute and memory. Existing fixes either naively enlarge context windows (very costly) or over-compress frames (lose detail). The field needs an approach that reduces cost enough to handle hour-scale video while keeping the fine-grained information needed for reasoning and retrieval.

Main Contribution

HiCo: a hierarchical two-stage video compression (clip-level token merging + video-level progressive dropout) that compresses to ~16 tokens/frame.

LongVid: a long-video instruction dataset assembled from public sources (reported 300k hours and 2B words) for long-form training.

Short-to-long multi-stage training recipe that mixes image, short-video, and long-video stages.

Multi-Hop Needle-in-a-Video-Haystack (MH-NIAH): a more robust, multi-step retrieval+reasoning benchmark.

VideoChat-Flash models (2B and 7B) that deliver SOTA open-source performance and much lower FLOPs for long-video inference.

Key Findings

HiCo compresses each frame to about 16 tokens (≈2% of dense tokenization) with almost no performance loss.

Numbers16 tokens/frame; compression ratio ≈2% (1/50)

Huge compute reduction enables single-GPU inference on very long videos.

Numbers10,000 frames: VideoChat-Flash FLOPs 9,969.5 vs LongVILA 1,184,250 TFLOPs

State-of-the-art open-source long-video retrieval on single-hop NIAH over 10k frames.

NumbersSingle-Hop NIAH retrieval accuracy 99.1% (10,000 frames)

Multi-hop long-video reasoning is still hard; MH-NIAH reveals large gaps.

NumbersMH-NIAH: CAP 31.3%, QA 25.4% (VideoChat-Flash) — ~8 points above LongVA

Short-to-long training and duration-based sampling materially boost performance.

NumbersAblation increases MVBench Avg from 60.2 → 74.0 after joint training and high-res finetune

Results

Accuracy

Value99.1% over 10,000 frames

BaselineLongVA ~92% (within 3k frames); LLama-VID 55%

Multi-Hop NIAH CAP/QA

ValueCAP 31.3%; QA 25.4% (average)

BaselineLongVA ~8 points lower

Tokens per frame used by VideoChat-Flash

Value16 tokens/frame (avg)

BaselineCommon dense ~196–729 tokens/frame

FLOPs for extreme long inference (10,000 frames)

Value9,969.5 TFLOPs (VideoChat-Flash)

Baseline1,184,250 TFLOPs (LongVILA)

MVBench average (7B VideoChat-Flash @448)

Value74.0

BaselineInternVL2-76B 69.6; GPT-4o 64.6

Who Should Care

What To Try In 7 Days

Prototype clip-level token merging (ToMe) on your video encoder and measure tokens/frame and downstream QA.

Apply duration-based sampling + timestamp prompts to your inference pipeline to balance short/long video detail.

Benchmark chained retrieval with MH-NIAH-style multi-hop probes before shipping long-video workflows.

Agent Features

Memory

  • long-context handling via compressed tokens

Architectures

  • hierarchical compression
  • spatio-temporal encoder + connector
  • LLM layer-wise visual dropout

Optimization Features

Token Efficiency

  • 16 tokens/frame (~1/50)
  • duration-based sampling (dense short, sparse long)

Infra Optimization

  • enables single-A100 inference on 10k frames; large FLOPs reductions

Model Optimization

  • clip-level token merging (ToMe)
  • spatio-temporal attention in video encoder (UMT-L)

System Optimization

  • high-resolution post-finetune (224→448) while freezing LLM

Training Optimization

  • multi-stage short-to-long curriculum
  • mix of short and long instruction tuning

Inference Optimization

  • progressive visual dropout (uniform shallow, attention-based deep)
  • video-level token selection only at inference

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Extreme compression can still lose rare fine-grained visual cues in complex multi-hop tasks.
  • Video-level compression is used only at inference due to training-compatibility issues (sequence parallelism).
  • Multi-hop reasoning accuracy remains low in absolute terms despite improvements.

When Not To Use

  • When exact per-frame pixel reconstruction or dense frame-level outputs are required.
  • When your training infra depends on sequence-parallel optimizations incompatible with video-level dropout.

Failure Modes

  • Dropping tokens that contain the single crucial frame for a multi-hop chain.
  • Bias introduced by attention selection in shallow layers if attention is not yet reliable.
  • Evaluation leakage if benchmark images/questions overlap training sources (paper notes this risk for NIAH-style tests).

Core Entities

Models

  • VideoChat-Flash
  • VideoChat-Flash @224
  • VideoChat-Flash @448
  • Qwen2-7B
  • UMT-L
  • InternVideo2-1B
  • LongVA
  • LLaMA-VID
  • LongVILA
  • GPT-4o
  • Gemini-1.5-Pro

Metrics

  • tokens per frame
  • FLOPs (TFLOPs)
  • inference memory (GB)
  • Accuracy
  • CAP/QA (MH-NIAH)
  • MVBench average
  • mIoU (Charades-STA)

Datasets

  • LongVid (constructed, reported 300k hours)
  • Ego4D
  • HowTo100M
  • HD-VILA
  • MiraData
  • WebVid2M
  • MS-COCO (images used in NIAH)

Benchmarks

  • MVBench
  • Perception Test
  • LongVideoBench
  • MLVU
  • VideoMME
  • LVBench
  • Charades-STA
  • AuroraCap
  • Needle-in-a-Video-Haystack (NIAH)
  • Multi-Hop NIAH (MH-NIAH)