StreamChat: real-time streaming video QA with hierarchical memory and sub-second latency

January 23, 20258 min

Overview

Decision SnapshotNeeds Validation

The approach is practical: memory + parallel scheduling show consistent accuracy and latency gains on evaluated benchmarks, but retrieval quality and VRAM limits constrain generality.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 65%

Authors

Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, Huchuan Lu

Links

Abstract / PDF / Code

Why It Matters For Business

StreamChat makes interactive video assistants and robotics feasible by cutting latency below 1s and improving streaming QA accuracy, reducing user wait and raising answer quality in live settings.

Who Should Care

Summary TLDR

StreamChat is a training-free Video-LLM pipeline that adds a three-tier hierarchical memory (short-term, long-term, dialogue) plus parallel system scheduling to handle long streaming videos and multi-turn Q&A in real time. On the new STREAMBENCH dataset it reaches 64.7% accuracy in online mode (+8.3% vs prior work), runs up to 32 FPS, and keeps request processing delay under 0.9s. The system trades VRAM and retrieval precision for speed and supports end-to-end interactive use cases like robotics and video assistants.

Problem Statement

Existing Video-LLMs were built for offline clips or single-turn QA. They struggle with long streams, multi-round conversation, real-time latency, and memory-efficient storage of past visual context while staying accurate.

Main Contribution

STREAMCHAT: a training-free streaming pipeline combining selective frame stacking, hierarchical memory (short/long/dialogue), and parallel system scheduling for real-time multi-turn video QA.

STREAMBENCH: a new streaming benchmark (306 videos, 1.8K QA pairs) that mixes egocentric, web, work, and movie videos and includes six question types plus latency metrics.

Key Findings

STREAMCHAT (Slow) achieves 64.7% accuracy on STREAMBENCH in online setting.

Numbers64.7% acc on STREAMBENCH (Slow)

Practical UseUse hierarchical memory + scheduling to get materially better online QA accuracy vs prior streaming methods.

Evidence RefTab.4

STREAMCHAT (Fast) processes video at 32 FPS and keeps request processing delay below 0.9s.

Numbers32 FPS; RPD ≈ 0.850.90s

Practical UseDeploy this pipeline for real-time applications (robotics, live assistants) where sub-second interactive latency matters.

Evidence RefTab.4, §3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy64.7%Video-online 56.4%+8.3%STREAMBENCH onlineTable 4 shows Slow model Acc 64.7 vs Video-online 56.4Tab.4
Processing throughput and latency (Fast)32 FPS; RPD 0.850.90s; text gen latency <0.9sVideo-online 5 FPS; RPD 1.07s~6x FPS; RPD −0.17sSTREAMBENCH onlineTable 4 reports Fast 32 FPS and RPD 0.85; Video-online 5 FPS, RPD 1.07Tab.4, §3.2

What To Try In 7 Days

Clone the StreamChat repo and run the Fast preset on a short live stream to measure FPS and RPD.

Benchmark the system on a small set of your videos and compare Fast/Base/Slow to balance accuracy vs latency.

Tune the optical-flow threshold and clustering chunk length to find your app's speed vs recall sweet spot.

Agent Features

Memory
short-term memory (recent embeddings)long-term memory tree (clustered chunks)dialogue memory (pre-encoded QA pairs)
Planning
parallel system scheduling across threads
Tool Use
FAISS similarity searchCLIP vision encoderLucas-Kanade optical flow
Frameworks
LongVA (base)CLIP-L-P14MiniLM-L6
Is Agentic

Yes

Architectures
training-free Video-LLM pipelinehierarchical memory tree

Optimization Features

Token Efficiency
clustered chunk tokens (v_i) compress visual tokens for retrieval
Infra Optimization
two-GPU setup with separated threads to reduce latency
Model Optimization
training-free design (no heavy finetuning)
System Optimization
decoupling feature extraction and memory updates to bound buffer sizetensor-parallel deployment across GPUs
Inference Optimization
parallel threads for frame stacking, memory formation, and summarizationselective frame stacking via optical-flow to drop redundant frames

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Current retrieval uses basic similarity matching and can return wrong or irrelevant memory items.

Tree-structured long-term memory increases VRAM use and may not scale to very long streams.

When Not To Use

When strict, lossless frame-level recall is required (e.g., forensic analysis).

On extremely long unbounded streams where VRAM and retrieval cost cannot scale.

Failure Modes

Retrieval mismatch: wrong memory item leads to incorrect answers.

Information loss from aggressive frame skipping or clustering causes missed short events.

Core Entities

Models

STREAMCHAT (Slow/Base/Fast)LongVACLIP-L-P14MiniLM-L6LLaMA-3 (scorer)

Metrics

AccuracyScore (0–5 semantic correctness via LLaMA-3)Coherence (score fluctuation)Request Processing Delay (RPD)FPS

Datasets

STREAMBENCHEgoSchemaYouTube-8MMSRVTTMSVDActivityNetNExT-QA

Benchmarks

STREAMBENCH (this paper)MSRVTT-QAActivityNet-QANExT-QAMSVD-QA

Context Entities

Models

Video-LLaVALLaMA-VIDGPT-4oMovieChatFlash-VStreamVideo-online

Metrics

Human benchmark scores

Datasets

YouTube-8M (source)EgoSchema (source)

Benchmarks

Prior streaming/offline video QA benchmarks