StreamChat: real-time streaming video QA with hierarchical memory and sub-second latency

January 23, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.5

Citation Count

1

Authors

Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, Huchuan Lu

Links

Abstract / PDF

Why It Matters For Business

StreamChat makes interactive video assistants and robotics feasible by cutting latency below 1s and improving streaming QA accuracy, reducing user wait and raising answer quality in live settings.

Summary TLDR

StreamChat is a training-free Video-LLM pipeline that adds a three-tier hierarchical memory (short-term, long-term, dialogue) plus parallel system scheduling to handle long streaming videos and multi-turn Q&A in real time. On the new STREAMBENCH dataset it reaches 64.7% accuracy in online mode (+8.3% vs prior work), runs up to 32 FPS, and keeps request processing delay under 0.9s. The system trades VRAM and retrieval precision for speed and supports end-to-end interactive use cases like robotics and video assistants.

Problem Statement

Existing Video-LLMs were built for offline clips or single-turn QA. They struggle with long streams, multi-round conversation, real-time latency, and memory-efficient storage of past visual context while staying accurate.

Main Contribution

STREAMCHAT: a training-free streaming pipeline combining selective frame stacking, hierarchical memory (short/long/dialogue), and parallel system scheduling for real-time multi-turn video QA.

STREAMBENCH: a new streaming benchmark (306 videos, 1.8K QA pairs) that mixes egocentric, web, work, and movie videos and includes six question types plus latency metrics.

Empirical analysis showing memory components map to task improvements and practical trade-offs between speed, VRAM, and accuracy.

Key Findings

STREAMCHAT (Slow) achieves 64.7% accuracy on STREAMBENCH in online setting.

Numbers64.7% acc on STREAMBENCH (Slow)

STREAMCHAT (Fast) processes video at 32 FPS and keeps request processing delay below 0.9s.

Numbers32 FPS; RPD ≈ 0.85–0.90s

Memory components yield task-specific gains: M_l improves long-term-memory tasks by 6.2%, M_s improves short-term tasks by 3.2%, M_d improves conversational interaction by 4.1%.

NumbersLM +6.2%; SM +3.2%; CI +4.1%

Raising the motion threshold to speed up processing saturates at 32 FPS but can lower accuracy (64.0% → 60.7%).

NumbersSpeed ↑ to 32 FPS; Acc drops 64.0%→60.7%

Across offline benchmarks, STREAMCHAT improves average accuracy by ~2.5% over LongVA when long-memory is used.

Numbers+2.5% avg accuracy (offline benchmarks)

Results

Accuracy

Value64.7%

BaselineVideo-online 56.4%

Processing throughput and latency (Fast)

Value32 FPS; RPD 0.85–0.90s; text gen latency <0.9s

BaselineVideo-online 5 FPS; RPD 1.07s

Accuracy

ValueAverage Acc 50.6% (across ActNet/NExT/MSVD/MSRVTT)

BaselineLongVA average Acc 48.1%

Who Should Care

What To Try In 7 Days

Clone the StreamChat repo and run the Fast preset on a short live stream to measure FPS and RPD.

Benchmark the system on a small set of your videos and compare Fast/Base/Slow to balance accuracy vs latency.

Tune the optical-flow threshold and clustering chunk length to find your app's speed vs recall sweet spot.

Agent Features

Memory

  • short-term memory (recent embeddings)
  • long-term memory tree (clustered chunks)
  • dialogue memory (pre-encoded QA pairs)

Planning

  • parallel system scheduling across threads

Tool Use

  • FAISS similarity search
  • CLIP vision encoder
  • Lucas-Kanade optical flow

Frameworks

  • LongVA (base)
  • CLIP-L-P14
  • MiniLM-L6

Is Agentic

true

Architectures

  • training-free Video-LLM pipeline
  • hierarchical memory tree

Optimization Features

Token Efficiency

  • clustered chunk tokens (v_i) compress visual tokens for retrieval

Infra Optimization

  • two-GPU setup with separated threads to reduce latency

Model Optimization

  • training-free design (no heavy finetuning)

System Optimization

  • decoupling feature extraction and memory updates to bound buffer size
  • tensor-parallel deployment across GPUs

Inference Optimization

  • parallel threads for frame stacking, memory formation, and summarization
  • selective frame stacking via optical-flow to drop redundant frames

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Current retrieval uses basic similarity matching and can return wrong or irrelevant memory items.
  • Tree-structured long-term memory increases VRAM use and may not scale to very long streams.
  • Selective frame skipping can remove brief but important events, hurting tiny/short object recall.
  • Deployment shown on two GPUs; single-GPU or serverless deployment will need engineering work.

When Not To Use

  • When strict, lossless frame-level recall is required (e.g., forensic analysis).
  • On extremely long unbounded streams where VRAM and retrieval cost cannot scale.
  • When you cannot afford multi-GPU deployment or need deterministic exactness.

Failure Modes

  • Retrieval mismatch: wrong memory item leads to incorrect answers.
  • Information loss from aggressive frame skipping or clustering causes missed short events.
  • Hallucination or incorrect synthesis when dialogue history contains prior errors.
  • Reduced accuracy when tuning thresholds for higher FPS.

Core Entities

Models

  • STREAMCHAT (Slow/Base/Fast)
  • LongVA
  • CLIP-L-P14
  • MiniLM-L6
  • LLaMA-3 (scorer)

Metrics

  • Accuracy
  • Score (0–5 semantic correctness via LLaMA-3)
  • Coherence (score fluctuation)
  • Request Processing Delay (RPD)
  • FPS

Datasets

  • STREAMBENCH
  • EgoSchema
  • YouTube-8M
  • MSRVTT
  • MSVD
  • ActivityNet
  • NExT-QA

Benchmarks

  • STREAMBENCH (this paper)
  • MSRVTT-QA
  • ActivityNet-QA
  • NExT-QA
  • MSVD-QA

Context Entities

Models

  • Video-LLaVA
  • LLaMA-VID
  • GPT-4o
  • MovieChat
  • Flash-VStream
  • Video-online

Metrics

  • Human benchmark scores

Datasets

  • YouTube-8M (source)
  • EgoSchema (source)

Benchmarks

  • Prior streaming/offline video QA benchmarks