ContextAgent: a proactive LLM agent that uses wearable sensors to reason and call tools automatically

May 20, 20257 min

Overview

Decision SnapshotNeeds Validation

The system is practically promising: improvements are clear on the released benchmark and ablations validate design choices, but real-world deployment needs more tools, privacy controls, and broader scenario coverage.

Citations1

Evidence Strength0.75

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 55%

Novelty: 70%

Authors

Bufang Yang, Lilin Xu, Liekang Zeng, Kaiwei Liu, Siyang Jiang, Wenrui Lu, Hongkai Chen, Xiaofan Jiang, Guoliang Xing, Zhenyu Yan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Context-aware proactive assistants can act without prompts, reducing user friction and automating multi-step tasks by calling real tools; this lowers manual work and enables new hands-free services for wearables.

Who Should Care

Summary TLDR

This paper introduces ContextAgent, a system that reads egocentric wearable sensor data (video, audio, notifications), extracts proactive-oriented context and user persona, then uses an LLM fine-tuned with explicit thought traces to decide whether to proactively act and to call external tools in sequence. The authors release ContextAgentBench (1,000 samples + 300-lite with raw sensors) and show ContextAgent improves proactive decision accuracy and tool-calling quality versus multiple baselines, with ablations on modalities, persona, and out-of-domain splits.

Problem Statement

Current proactive LLM agents either only see closed environments (e.g., desktop UIs) or use rule-based triggers. They lack open-world sensory perception and automatic tool-augmented actions, which limits real-world proactive assistance from wearable devices.

Main Contribution

Define context-aware proactive agent task that uses multi-modal wearable data and persona context to trigger tool-based services.

Propose ContextAgent: proactive-oriented context extraction + context-aware reasoner that is fine-tuned with distilled thought traces (think-before-act).

Key Findings

ContextAgent raises proactive-decision accuracy and tool-calling correctness over baselines on the main benchmark.

NumbersAcc-P +8.5%, F1 +7.0%, Acc-Args +6.0% (Llama3.1-8B base)

Practical UseIf you fine-tune a medium-sized LLM with ContextAgent components, expect single-digit to low-double-digit gains in deciding when to act and in correct tool calls on their benchmark.

Evidence RefAbstract; Sec.5.2; Tab.1

Vision and audio both matter; losing vision hurts most.

Numbersw/o vision: Acc-P −17.9%, F1 −23.3% (max reported drops)

Practical UseFor reliable proactive behavior keep egocentric video; audio helps but vision loss produces the largest performance drop.

Evidence RefSec.5.3; Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyContextAgent 0.874 vs best baseline 0.813 (Llama3.1-8B)Vanilla SFT 0.813+0.061ContextAgentBenchTab.1; Sec.5.2Tab.1
Tool calling F1ContextAgent 0.660 vs best baseline 0.580 (Llama3.1-8B)Vanilla SFT 0.580+0.080ContextAgentBenchTab.1; Sec.5.2Tab.1

What To Try In 7 Days

Run a small prototype: collect short egocentric video+audio snippets and extract contexts with a VLM + speech recognizer.

Fine-tune a 7B instruction LLM with a few dozen distilled thought-trace examples (CoT SFT) and test proactive score thresholding.

Integrate one or two external APIs (e.g., weather, GPS, calendar) and test tool-chain correctness on a small scenario set.

Agent Features

Memory
persona context (short-term / historical summaries)
Planning
think-before-act CoT reasoningsequential tool-chain planning
Tool Use
function callingmulti-tool chains (20 tool types)
Frameworks
SFTin-context learning (ICL) for data gen and baselines
Is Agentic

Yes

Architectures
LLM-based reasoner (fine-tuned LLM)

Optimization Features

Token Efficiency
Few-shot ICL baselines use 10-shot demos
Infra Optimization
Experiments run on 8 A6000 GPUs
Model Optimization
LoRA
Training Optimization
SFTAdamW optimizer, cosine scheduler, 5 epochs

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Tool set is limited to 20 predefined APIs and may not cover every real-world need.

Benchmarks cover nine scenarios; more diversity is needed for broad deployment.

When Not To Use

When users cannot give informed consent to wearable sensor collection.

In safety-critical domains where automated external actions may cause harm.

Failure Modes

False-positive proactive triggers that annoy or interrupt users.

Missed detections that fail to offer timely help.

Core Entities

Models

Llama-3.1-8B-InstructLlama-3.1-70B-InstructQwen2.5-7B-InstructQwen2.5-72B-InstructDeepSeek-R1-7BGPT-4oClaude Sonnet 4

Metrics

Acc-PMDFDRMSEPrecisionRecallF1-scoreAcc-Args

Datasets

ContextAgentBenchContextAgentBench-Lite

Benchmarks

ContextAgentBench