Overview
The approach is practical: memory + parallel scheduling show consistent accuracy and latency gains on evaluated benchmarks, but retrieval quality and VRAM limits constrain generality.
Citations1
Evidence Strength0.70
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 65%
Why It Matters For Business
StreamChat makes interactive video assistants and robotics feasible by cutting latency below 1s and improving streaming QA accuracy, reducing user wait and raising answer quality in live settings.
Who Should Care
Summary TLDR
StreamChat is a training-free Video-LLM pipeline that adds a three-tier hierarchical memory (short-term, long-term, dialogue) plus parallel system scheduling to handle long streaming videos and multi-turn Q&A in real time. On the new STREAMBENCH dataset it reaches 64.7% accuracy in online mode (+8.3% vs prior work), runs up to 32 FPS, and keeps request processing delay under 0.9s. The system trades VRAM and retrieval precision for speed and supports end-to-end interactive use cases like robotics and video assistants.
Problem Statement
Existing Video-LLMs were built for offline clips or single-turn QA. They struggle with long streams, multi-round conversation, real-time latency, and memory-efficient storage of past visual context while staying accurate.
Main Contribution
STREAMCHAT: a training-free streaming pipeline combining selective frame stacking, hierarchical memory (short/long/dialogue), and parallel system scheduling for real-time multi-turn video QA.
STREAMBENCH: a new streaming benchmark (306 videos, 1.8K QA pairs) that mixes egocentric, web, work, and movie videos and includes six question types plus latency metrics.
Key Findings
STREAMCHAT (Slow) achieves 64.7% accuracy on STREAMBENCH in online setting.
STREAMCHAT (Fast) processes video at 32 FPS and keeps request processing delay below 0.9s.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 64.7% | Video-online 56.4% | +8.3% | STREAMBENCH online | Table 4 shows Slow model Acc 64.7 vs Video-online 56.4 | Tab.4 |
| Processing throughput and latency (Fast) | 32 FPS; RPD 0.85–0.90s; text gen latency <0.9s | Video-online 5 FPS; RPD 1.07s | ~6x FPS; RPD −0.17s | STREAMBENCH online | Table 4 reports Fast 32 FPS and RPD 0.85; Video-online 5 FPS, RPD 1.07 | Tab.4, §3.2 |
What To Try In 7 Days
Clone the StreamChat repo and run the Fast preset on a short live stream to measure FPS and RPD.
Benchmark the system on a small set of your videos and compare Fast/Base/Slow to balance accuracy vs latency.
Tune the optical-flow threshold and clustering chunk length to find your app's speed vs recall sweet spot.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Current retrieval uses basic similarity matching and can return wrong or irrelevant memory items.
Tree-structured long-term memory increases VRAM use and may not scale to very long streams.
When Not To Use
When strict, lossless frame-level recall is required (e.g., forensic analysis).
On extremely long unbounded streams where VRAM and retrieval cost cannot scale.
Failure Modes
Retrieval mismatch: wrong memory item leads to incorrect answers.
Information loss from aggressive frame skipping or clustering causes missed short events.

