Overview
SEG is a practical, low-cost interface for long-video QA, but evidence is limited to five videos and automatic LLM judging, so expect engineering work and broader evaluation before production deployment.
Citations0
Evidence Strength0.50
Confidence0.78
Risk Signals11
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 72%
Production readiness: 45%
Novelty: 70%
Why It Matters For Business
SEG cuts LLM token costs by ~10× for long-video QA while keeping accuracy, letting companies add long-horizon video reasoning without expensive model or GPU scaling.
Who Should Care
Summary TLDR
The paper introduces Semantic Event Graphs (SEG): a pipeline that turns long video frames into START/END human-object events, builds a Temporal Scene Graph (TSG), prunes the graph per query, verbalizes the subgraph, and sends it to a large multimodal LLM (Gemini 2.5 Flash). On five Creative Commons YouTube videos and 120 auto-generated long-horizon questions, SEG reduced input tokens by 91.4% (40.39k → 3.47k tokens) while matching or slightly improving QA accuracy (Full Log 62.5% vs TSG 65.0%). Results are promising but limited by a small dataset and automatic LLM-based evaluation.
Problem Statement
Long videos have thousands of frames, making direct VLM prompting costly or impossible. Existing frame sampling or dense embeddings either miss important interactions or blow up token costs. The paper asks: can we compress long videos into lightweight symbolic events that preserve order, duration, and causality so off-the-shelf LLMs can answer long-horizon questions with far fewer tokens?
Main Contribution
Semantic Event Graphs (SEG): a pipeline that extracts START/END human-object interaction events from tracked detections and assembles them into a Temporal Scene Graph (TSG).
Query-aware pruning that selects a small, relevant subgraph and verbalizes it as a compact narrative for an LLM.
Key Findings
SEG cuts token input by 91.4% on average.
TSG matches or slightly improves QA accuracy compared to full unpruned logs.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Average tokens per query | 3.47k | Full Log 40.39k | -91.4% | Five videos, 120 QA pairs | Table 1; Table 2 | Section 4; Tables 1–2 |
| Accuracy | 65.0% | Full Log 62.5% | +2.5 pp | Five videos, 120 QA pairs | Table 1; Section 4.1 | Table 1 |
What To Try In 7 Days
Run YOLOv11 + tracking on a small set of long videos to extract START/END events.
Build a simple temporal graph (NetworkX) and convert a pruned subgraph to text for an off-the-shelf LLM.
Compare cost and accuracy vs sending dense frames or embeddings to your existing VLM.
Agent Features
Memory
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Small evaluation: only five videos and 120 auto-generated QAs, so results may not generalize.
Query grounding is string-match based and brittle to synonyms, paraphrases, and pronouns.
When Not To Use
Tasks that require fine-grained appearance recognition (color, texture, exact identity).
Scenarios with frequent off-camera actions or missing detection coverage.
Failure Modes
Lexical brittleness: anchor string matches miss synonyms (e.g., 'mug' vs 'cup').
Missing events when actors leave frame, leading to incomplete answers.

