Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
StreamChat makes interactive video assistants and robotics feasible by cutting latency below 1s and improving streaming QA accuracy, reducing user wait and raising answer quality in live settings.
Summary TLDR
StreamChat is a training-free Video-LLM pipeline that adds a three-tier hierarchical memory (short-term, long-term, dialogue) plus parallel system scheduling to handle long streaming videos and multi-turn Q&A in real time. On the new STREAMBENCH dataset it reaches 64.7% accuracy in online mode (+8.3% vs prior work), runs up to 32 FPS, and keeps request processing delay under 0.9s. The system trades VRAM and retrieval precision for speed and supports end-to-end interactive use cases like robotics and video assistants.
Problem Statement
Existing Video-LLMs were built for offline clips or single-turn QA. They struggle with long streams, multi-round conversation, real-time latency, and memory-efficient storage of past visual context while staying accurate.
Main Contribution
STREAMCHAT: a training-free streaming pipeline combining selective frame stacking, hierarchical memory (short/long/dialogue), and parallel system scheduling for real-time multi-turn video QA.
STREAMBENCH: a new streaming benchmark (306 videos, 1.8K QA pairs) that mixes egocentric, web, work, and movie videos and includes six question types plus latency metrics.
Empirical analysis showing memory components map to task improvements and practical trade-offs between speed, VRAM, and accuracy.
Key Findings
STREAMCHAT (Slow) achieves 64.7% accuracy on STREAMBENCH in online setting.
STREAMCHAT (Fast) processes video at 32 FPS and keeps request processing delay below 0.9s.
Memory components yield task-specific gains: M_l improves long-term-memory tasks by 6.2%, M_s improves short-term tasks by 3.2%, M_d improves conversational interaction by 4.1%.
Raising the motion threshold to speed up processing saturates at 32 FPS but can lower accuracy (64.0% → 60.7%).
Across offline benchmarks, STREAMCHAT improves average accuracy by ~2.5% over LongVA when long-memory is used.
Results
Accuracy
Processing throughput and latency (Fast)
Accuracy
Who Should Care
What To Try In 7 Days
Clone the StreamChat repo and run the Fast preset on a short live stream to measure FPS and RPD.
Benchmark the system on a small set of your videos and compare Fast/Base/Slow to balance accuracy vs latency.
Tune the optical-flow threshold and clustering chunk length to find your app's speed vs recall sweet spot.
Agent Features
Memory
- short-term memory (recent embeddings)
- long-term memory tree (clustered chunks)
- dialogue memory (pre-encoded QA pairs)
Planning
- parallel system scheduling across threads
Tool Use
- FAISS similarity search
- CLIP vision encoder
- Lucas-Kanade optical flow
Frameworks
- LongVA (base)
- CLIP-L-P14
- MiniLM-L6
Is Agentic
true
Architectures
- training-free Video-LLM pipeline
- hierarchical memory tree
Optimization Features
Token Efficiency
- clustered chunk tokens (v_i) compress visual tokens for retrieval
Infra Optimization
- two-GPU setup with separated threads to reduce latency
Model Optimization
- training-free design (no heavy finetuning)
System Optimization
- decoupling feature extraction and memory updates to bound buffer size
- tensor-parallel deployment across GPUs
Inference Optimization
- parallel threads for frame stacking, memory formation, and summarization
- selective frame stacking via optical-flow to drop redundant frames
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Current retrieval uses basic similarity matching and can return wrong or irrelevant memory items.
- Tree-structured long-term memory increases VRAM use and may not scale to very long streams.
- Selective frame skipping can remove brief but important events, hurting tiny/short object recall.
- Deployment shown on two GPUs; single-GPU or serverless deployment will need engineering work.
When Not To Use
- When strict, lossless frame-level recall is required (e.g., forensic analysis).
- On extremely long unbounded streams where VRAM and retrieval cost cannot scale.
- When you cannot afford multi-GPU deployment or need deterministic exactness.
Failure Modes
- Retrieval mismatch: wrong memory item leads to incorrect answers.
- Information loss from aggressive frame skipping or clustering causes missed short events.
- Hallucination or incorrect synthesis when dialogue history contains prior errors.
- Reduced accuracy when tuning thresholds for higher FPS.
Core Entities
Models
- STREAMCHAT (Slow/Base/Fast)
- LongVA
- CLIP-L-P14
- MiniLM-L6
- LLaMA-3 (scorer)
Metrics
- Accuracy
- Score (0–5 semantic correctness via LLaMA-3)
- Coherence (score fluctuation)
- Request Processing Delay (RPD)
- FPS
Datasets
- STREAMBENCH
- EgoSchema
- YouTube-8M
- MSRVTT
- MSVD
- ActivityNet
- NExT-QA
Benchmarks
- STREAMBENCH (this paper)
- MSRVTT-QA
- ActivityNet-QA
- NExT-QA
- MSVD-QA
Context Entities
Models
- Video-LLaVA
- LLaMA-VID
- GPT-4o
- MovieChat
- Flash-VStream
- Video-online
Metrics
- Human benchmark scores
Datasets
- YouTube-8M (source)
- EgoSchema (source)
Benchmarks
- Prior streaming/offline video QA benchmarks

