Compress long videos into timestamped event graphs so LLMs can answer long-horizon questions cheaply

January 2, 20267 min

Overview

Decision SnapshotNeeds Validation

SEG is a practical, low-cost interface for long-video QA, but evidence is limited to five videos and automatic LLM judging, so expect engineering work and broader evaluation before production deployment.

Citations0

Evidence Strength0.50

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 72%

Production readiness: 45%

Novelty: 70%

Authors

Aradhya Dixit, Tianxi Liang

Links

Abstract / PDF / Data

Why It Matters For Business

SEG cuts LLM token costs by ~10× for long-video QA while keeping accuracy, letting companies add long-horizon video reasoning without expensive model or GPU scaling.

Who Should Care

Summary TLDR

The paper introduces Semantic Event Graphs (SEG): a pipeline that turns long video frames into START/END human-object events, builds a Temporal Scene Graph (TSG), prunes the graph per query, verbalizes the subgraph, and sends it to a large multimodal LLM (Gemini 2.5 Flash). On five Creative Commons YouTube videos and 120 auto-generated long-horizon questions, SEG reduced input tokens by 91.4% (40.39k → 3.47k tokens) while matching or slightly improving QA accuracy (Full Log 62.5% vs TSG 65.0%). Results are promising but limited by a small dataset and automatic LLM-based evaluation.

Problem Statement

Long videos have thousands of frames, making direct VLM prompting costly or impossible. Existing frame sampling or dense embeddings either miss important interactions or blow up token costs. The paper asks: can we compress long videos into lightweight symbolic events that preserve order, duration, and causality so off-the-shelf LLMs can answer long-horizon questions with far fewer tokens?

Main Contribution

Semantic Event Graphs (SEG): a pipeline that extracts START/END human-object interaction events from tracked detections and assembles them into a Temporal Scene Graph (TSG).

Query-aware pruning that selects a small, relevant subgraph and verbalizes it as a compact narrative for an LLM.

Key Findings

SEG cuts token input by 91.4% on average.

NumbersTokens: Full Log 40.39k → TSG 3.47k (91.4% reduction)

Practical UseUse SEG to reduce LLM token cost by an order of magnitude when answering long-video questions.

Evidence RefAbstract; Table 1; Table 2

TSG matches or slightly improves QA accuracy compared to full unpruned logs.

NumbersAccuracy: Full Log 62.5% → TSG 65.0%

Practical UseCompression via symbolic events can preserve reasoning quality while saving cost; try TSG before scaling model context.

Evidence RefTable 1; Section 4.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average tokens per query3.47kFull Log 40.39k-91.4%Five videos, 120 QA pairsTable 1; Table 2Section 4; Tables 1–2
Accuracy65.0%Full Log 62.5%+2.5 ppFive videos, 120 QA pairsTable 1; Section 4.1Table 1

What To Try In 7 Days

Run YOLOv11 + tracking on a small set of long videos to extract START/END events.

Build a simple temporal graph (NetworkX) and convert a pruned subgraph to text for an off-the-shelf LLM.

Compare cost and accuracy vs sending dense frames or embeddings to your existing VLM.

Agent Features

Memory
Temporal Scene Graph (symbolic event memory)

Optimization Features

Token Efficiency
91.4% token compression3.47k tokens per query vs 40.39k
System Optimization
Offloads visual processing to lightweight detection+trackingSymbolic bottleneck reduces LLM attention burden
Inference Optimization
Token reduction via query pruningReduced LLM context size per query

Reproducibility

Risks & Boundaries

Limitations

Small evaluation: only five videos and 120 auto-generated QAs, so results may not generalize.

Query grounding is string-match based and brittle to synonyms, paraphrases, and pronouns.

When Not To Use

Tasks that require fine-grained appearance recognition (color, texture, exact identity).

Scenarios with frequent off-camera actions or missing detection coverage.

Failure Modes

Lexical brittleness: anchor string matches miss synonyms (e.g., 'mug' vs 'cup').

Missing events when actors leave frame, leading to incomplete answers.

Core Entities

Models

YOLOv11Gemini 2.5 FlashNetworkX MultiDiGraph

Metrics

AccuracyToken countCompression ratio

Datasets

Five Creative Commons YouTube videos (10–20 min each)

Benchmarks

120 auto-generated long-horizon QA pairs (authors' evaluation)

Context Entities

Models

LLM-guided question generation (unspecified LLM)

Metrics

Accuracy

Datasets

YouTube video URLs included in paper

Benchmarks

Category breakdown: Temporal Ordering, Object Interaction, Duration Reasoning, Causal Chains