Compress long videos into timestamped event graphs so LLMs can answer long-horizon questions cheaply

January 2, 20267 min

Overview

Production Readiness

0.45

Novelty Score

0.7

Cost Impact Score

0.72

Citation Count

0

Authors

Aradhya Dixit, Tianxi Liang

Links

Abstract / PDF

Why It Matters For Business

SEG cuts LLM token costs by ~10× for long-video QA while keeping accuracy, letting companies add long-horizon video reasoning without expensive model or GPU scaling.

Summary TLDR

The paper introduces Semantic Event Graphs (SEG): a pipeline that turns long video frames into START/END human-object events, builds a Temporal Scene Graph (TSG), prunes the graph per query, verbalizes the subgraph, and sends it to a large multimodal LLM (Gemini 2.5 Flash). On five Creative Commons YouTube videos and 120 auto-generated long-horizon questions, SEG reduced input tokens by 91.4% (40.39k → 3.47k tokens) while matching or slightly improving QA accuracy (Full Log 62.5% vs TSG 65.0%). Results are promising but limited by a small dataset and automatic LLM-based evaluation.

Problem Statement

Long videos have thousands of frames, making direct VLM prompting costly or impossible. Existing frame sampling or dense embeddings either miss important interactions or blow up token costs. The paper asks: can we compress long videos into lightweight symbolic events that preserve order, duration, and causality so off-the-shelf LLMs can answer long-horizon questions with far fewer tokens?

Main Contribution

Semantic Event Graphs (SEG): a pipeline that extracts START/END human-object interaction events from tracked detections and assembles them into a Temporal Scene Graph (TSG).

Query-aware pruning that selects a small, relevant subgraph and verbalizes it as a compact narrative for an LLM.

Empirical demonstration: 91.4% average token reduction with equal-or-better QA accuracy on a 5-video, 120-question benchmark using Gemini 2.5 Flash.

A small public dataset of five long-form Creative Commons YouTube videos plus 120 auto-generated QA pairs; authors state code and tools will be released.

Key Findings

SEG cuts token input by 91.4% on average.

NumbersTokens: Full Log 40.39k → TSG 3.47k (91.4% reduction)

TSG matches or slightly improves QA accuracy compared to full unpruned logs.

NumbersAccuracy: Full Log 62.5% → TSG 65.0%

Short-context (last 30s) baseline fails for long-horizon queries.

NumbersShort-context accuracy 2.5% vs TSG 65.0%

Results

Average tokens per query

Value3.47k

BaselineFull Log 40.39k

Accuracy

Value65.0%

BaselineFull Log 62.5%

Accuracy

Value2.5%

BaselineTSG 65.0%

Accuracy

Value69% (TSG)

BaselineFull Log 65%

Who Should Care

What To Try In 7 Days

Run YOLOv11 + tracking on a small set of long videos to extract START/END events.

Build a simple temporal graph (NetworkX) and convert a pruned subgraph to text for an off-the-shelf LLM.

Compare cost and accuracy vs sending dense frames or embeddings to your existing VLM.

Agent Features

Memory

  • Temporal Scene Graph (symbolic event memory)

Optimization Features

Token Efficiency

  • 91.4% token compression
  • 3.47k tokens per query vs 40.39k

System Optimization

  • Offloads visual processing to lightweight detection+tracking
  • Symbolic bottleneck reduces LLM attention burden

Inference Optimization

  • Token reduction via query pruning
  • Reduced LLM context size per query

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Small evaluation: only five videos and 120 auto-generated QAs, so results may not generalize.
  • Query grounding is string-match based and brittle to synonyms, paraphrases, and pronouns.
  • No visual grounding for appearance attributes (color, fine identity, pose).
  • Off-camera actions are unrecorded, causing gaps in the event log.
  • Evaluation uses the same model family (Gemini) as answerer and judge, introducing potential bias.

When Not To Use

  • Tasks that require fine-grained appearance recognition (color, texture, exact identity).
  • Scenarios with frequent off-camera actions or missing detection coverage.
  • Applications needing verified human-evaluated correctness at scale without model-based auto-judging.

Failure Modes

  • Lexical brittleness: anchor string matches miss synonyms (e.g., 'mug' vs 'cup').
  • Missing events when actors leave frame, leading to incomplete answers.
  • Complex chains involving many entities can be hard to reconstruct from 1-hop pruning.

Core Entities

Models

  • YOLOv11
  • Gemini 2.5 Flash
  • NetworkX MultiDiGraph

Metrics

  • Accuracy
  • Token count
  • Compression ratio

Datasets

  • Five Creative Commons YouTube videos (10–20 min each)

Benchmarks

  • 120 auto-generated long-horizon QA pairs (authors' evaluation)

Context Entities

Models

  • LLM-guided question generation (unspecified LLM)

Metrics

  • Accuracy

Datasets

  • YouTube video URLs included in paper

Benchmarks

  • Category breakdown: Temporal Ordering, Object Interaction, Duration Reasoning, Causal Chains