Overview
Production Readiness
0.45
Novelty Score
0.7
Cost Impact Score
0.72
Citation Count
0
Why It Matters For Business
SEG cuts LLM token costs by ~10× for long-video QA while keeping accuracy, letting companies add long-horizon video reasoning without expensive model or GPU scaling.
Summary TLDR
The paper introduces Semantic Event Graphs (SEG): a pipeline that turns long video frames into START/END human-object events, builds a Temporal Scene Graph (TSG), prunes the graph per query, verbalizes the subgraph, and sends it to a large multimodal LLM (Gemini 2.5 Flash). On five Creative Commons YouTube videos and 120 auto-generated long-horizon questions, SEG reduced input tokens by 91.4% (40.39k → 3.47k tokens) while matching or slightly improving QA accuracy (Full Log 62.5% vs TSG 65.0%). Results are promising but limited by a small dataset and automatic LLM-based evaluation.
Problem Statement
Long videos have thousands of frames, making direct VLM prompting costly or impossible. Existing frame sampling or dense embeddings either miss important interactions or blow up token costs. The paper asks: can we compress long videos into lightweight symbolic events that preserve order, duration, and causality so off-the-shelf LLMs can answer long-horizon questions with far fewer tokens?
Main Contribution
Semantic Event Graphs (SEG): a pipeline that extracts START/END human-object interaction events from tracked detections and assembles them into a Temporal Scene Graph (TSG).
Query-aware pruning that selects a small, relevant subgraph and verbalizes it as a compact narrative for an LLM.
Empirical demonstration: 91.4% average token reduction with equal-or-better QA accuracy on a 5-video, 120-question benchmark using Gemini 2.5 Flash.
A small public dataset of five long-form Creative Commons YouTube videos plus 120 auto-generated QA pairs; authors state code and tools will be released.
Key Findings
SEG cuts token input by 91.4% on average.
TSG matches or slightly improves QA accuracy compared to full unpruned logs.
Short-context (last 30s) baseline fails for long-horizon queries.
Results
Average tokens per query
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run YOLOv11 + tracking on a small set of long videos to extract START/END events.
Build a simple temporal graph (NetworkX) and convert a pruned subgraph to text for an off-the-shelf LLM.
Compare cost and accuracy vs sending dense frames or embeddings to your existing VLM.
Agent Features
Memory
- Temporal Scene Graph (symbolic event memory)
Optimization Features
Token Efficiency
- 91.4% token compression
- 3.47k tokens per query vs 40.39k
System Optimization
- Offloads visual processing to lightweight detection+tracking
- Symbolic bottleneck reduces LLM attention burden
Inference Optimization
- Token reduction via query pruning
- Reduced LLM context size per query
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Small evaluation: only five videos and 120 auto-generated QAs, so results may not generalize.
- Query grounding is string-match based and brittle to synonyms, paraphrases, and pronouns.
- No visual grounding for appearance attributes (color, fine identity, pose).
- Off-camera actions are unrecorded, causing gaps in the event log.
- Evaluation uses the same model family (Gemini) as answerer and judge, introducing potential bias.
When Not To Use
- Tasks that require fine-grained appearance recognition (color, texture, exact identity).
- Scenarios with frequent off-camera actions or missing detection coverage.
- Applications needing verified human-evaluated correctness at scale without model-based auto-judging.
Failure Modes
- Lexical brittleness: anchor string matches miss synonyms (e.g., 'mug' vs 'cup').
- Missing events when actors leave frame, leading to incomplete answers.
- Complex chains involving many entities can be hard to reconstruct from 1-hop pruning.
Core Entities
Models
- YOLOv11
- Gemini 2.5 Flash
- NetworkX MultiDiGraph
Metrics
- Accuracy
- Token count
- Compression ratio
Datasets
- Five Creative Commons YouTube videos (10–20 min each)
Benchmarks
- 120 auto-generated long-horizon QA pairs (authors' evaluation)
Context Entities
Models
- LLM-guided question generation (unspecified LLM)
Metrics
- Accuracy
Datasets
- YouTube video URLs included in paper
Benchmarks
- Category breakdown: Temporal Ordering, Object Interaction, Duration Reasoning, Causal Chains

