Compress long videos into timestamped event graphs so LLMs can answer long-horizon questions cheaply

Overview

Decision SnapshotNeeds Validation

SEG is a practical, low-cost interface for long-video QA, but evidence is limited to five videos and automatic LLM judging, so expect engineering work and broader evaluation before production deployment.

Citations0

Evidence Strength0.50

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 72%

Production readiness: 45%

Novelty: 70%

Authors

Aradhya Dixit, Tianxi Liang

Links

Abstract / PDF / Data

Why It Matters For Business

SEG cuts LLM token costs by ~10× for long-video QA while keeping accuracy, letting companies add long-horizon video reasoning without expensive model or GPU scaling.

Who Should Care

ML Engineer Product Manager CTO Founder

Summary TLDR

The paper introduces Semantic Event Graphs (SEG): a pipeline that turns long video frames into START/END human-object events, builds a Temporal Scene Graph (TSG), prunes the graph per query, verbalizes the subgraph, and sends it to a large multimodal LLM (Gemini 2.5 Flash). On five Creative Commons YouTube videos and 120 auto-generated long-horizon questions, SEG reduced input tokens by 91.4% (40.39k → 3.47k tokens) while matching or slightly improving QA accuracy (Full Log 62.5% vs TSG 65.0%). Results are promising but limited by a small dataset and automatic LLM-based evaluation.

Problem Statement

Long videos have thousands of frames, making direct VLM prompting costly or impossible. Existing frame sampling or dense embeddings either miss important interactions or blow up token costs. The paper asks: can we compress long videos into lightweight symbolic events that preserve order, duration, and causality so off-the-shelf LLMs can answer long-horizon questions with far fewer tokens?

Main Contribution

Semantic Event Graphs (SEG): a pipeline that extracts START/END human-object interaction events from tracked detections and assembles them into a Temporal Scene Graph (TSG).

Query-aware pruning that selects a small, relevant subgraph and verbalizes it as a compact narrative for an LLM.

Key Findings

SEG cuts token input by 91.4% on average.

NumbersTokens: Full Log 40.39k → TSG 3.47k (91.4% reduction)

Practical UseUse SEG to reduce LLM token cost by an order of magnitude when answering long-video questions.

Evidence RefAbstract; Table 1; Table 2

TSG matches or slightly improves QA accuracy compared to full unpruned logs.

NumbersAccuracy: Full Log 62.5% → TSG 65.0%

Practical UseCompression via symbolic events can preserve reasoning quality while saving cost; try TSG before scaling model context.

Evidence RefTable 1; Section 4.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average tokens per query	3.47k	Full Log 40.39k	-91.4%	Five videos, 120 QA pairs	Table 1; Table 2	Section 4; Tables 1–2
Accuracy	65.0%	Full Log 62.5%	+2.5 pp	Five videos, 120 QA pairs	Table 1; Section 4.1	Table 1

What To Try In 7 Days

Run YOLOv11 + tracking on a small set of long videos to extract START/END events.

Build a simple temporal graph (NetworkX) and convert a pruned subgraph to text for an off-the-shelf LLM.

Compare cost and accuracy vs sending dense frames or embeddings to your existing VLM.

Agent Features

Memory

Temporal Scene Graph (symbolic event memory)

Optimization Features

Token Efficiency

91.4% token compression3.47k tokens per query vs 40.39k

System Optimization

Offloads visual processing to lightweight detection+trackingSymbolic bottleneck reduces LLM attention burden

Inference Optimization

Token reduction via query pruningReduced LLM context size per query

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://www.youtube.com/watch?v=Wteauo6RlpE https://www.youtube.com/watch?v=9muGWhn1shw https://www.youtube.com/watch?v=35vY_c6h23I https://www.youtube.com/watch?v=NGDjqka3MAw https://www.youtube.com/watch?v=phLHLJISaoE

Risks & Boundaries

Limitations

Small evaluation: only five videos and 120 auto-generated QAs, so results may not generalize.

Query grounding is string-match based and brittle to synonyms, paraphrases, and pronouns.

When Not To Use

Tasks that require fine-grained appearance recognition (color, texture, exact identity).

Scenarios with frequent off-camera actions or missing detection coverage.

Failure Modes

Lexical brittleness: anchor string matches miss synonyms (e.g., 'mug' vs 'cup').

Missing events when actors leave frame, leading to incomplete answers.

Core Entities

Models

YOLOv11Gemini 2.5 FlashNetworkX MultiDiGraph

Metrics

AccuracyToken countCompression ratio

Datasets

Five Creative Commons YouTube videos (10–20 min each)

Benchmarks

120 auto-generated long-horizon QA pairs (authors' evaluation)

Context Entities

Models

LLM-guided question generation (unspecified LLM)

Metrics

Accuracy

Datasets

YouTube video URLs included in paper

Benchmarks

Category breakdown: Temporal Ordering, Object Interaction, Duration Reasoning, Causal Chains

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SEG cuts token input by 91.4% on average.

TSG matches or slightly improves QA accuracy compared to full unpruned logs.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Focus: agent-controlled context compression that cuts token use 22.7% without losing accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Prompt LLMs to propose hyperparameters and training code; they match or beat standard HPO early in search.

Key finding

MiniCache: merge adjacent layers' KV caches to cut memory and speed up LLM inference

Key finding