ContextAgent: a proactive LLM agent that uses wearable sensors to reason and call tools automatically

May 20, 20257 min

Overview

Production Readiness

0.55

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

1

Authors

Bufang Yang, Lilin Xu, Liekang Zeng, Kaiwei Liu, Siyang Jiang, Wenrui Lu, Hongkai Chen, Xiaofan Jiang, Guoliang Xing, Zhenyu Yan

Links

Abstract / PDF

Why It Matters For Business

Context-aware proactive assistants can act without prompts, reducing user friction and automating multi-step tasks by calling real tools; this lowers manual work and enables new hands-free services for wearables.

Summary TLDR

This paper introduces ContextAgent, a system that reads egocentric wearable sensor data (video, audio, notifications), extracts proactive-oriented context and user persona, then uses an LLM fine-tuned with explicit thought traces to decide whether to proactively act and to call external tools in sequence. The authors release ContextAgentBench (1,000 samples + 300-lite with raw sensors) and show ContextAgent improves proactive decision accuracy and tool-calling quality versus multiple baselines, with ablations on modalities, persona, and out-of-domain splits.

Problem Statement

Current proactive LLM agents either only see closed environments (e.g., desktop UIs) or use rule-based triggers. They lack open-world sensory perception and automatic tool-augmented actions, which limits real-world proactive assistance from wearable devices.

Main Contribution

Define context-aware proactive agent task that uses multi-modal wearable data and persona context to trigger tool-based services.

Propose ContextAgent: proactive-oriented context extraction + context-aware reasoner that is fine-tuned with distilled thought traces (think-before-act).

Introduce ContextAgentBench: 1,000 annotated samples across nine daily scenarios and 20 tools, plus a 300-sample 'Lite' set with raw sensor data.

Comprehensive evaluation vs six baselines and 13 LLMs, with modality and ablation studies and out-of-domain tests.

Key Findings

ContextAgent raises proactive-decision accuracy and tool-calling correctness over baselines on the main benchmark.

NumbersAcc-P +8.5%, F1 +7.0%, Acc-Args +6.0% (Llama3.1-8B base)

Vision and audio both matter; losing vision hurts most.

Numbersw/o vision: Acc-P −17.9%, F1 −23.3% (max reported drops)

Persona context substantially improves proactive predictions and tool arguments.

NumbersRemoving personas reduced Acc-P up to 12.0% and Acc-Args up to 14.3%

ContextAgent generalizes reasonably in OOD splits and can match large proprietary LLM baselines in many metrics.

NumbersOOD: Acc-P up to 90.9%, F1 68.9%, Acc-Args 51.6%; outperforms best baseline by Acc-P +8.3%, F1 +10.7%

Results

Accuracy

ValueContextAgent 0.874 vs best baseline 0.813 (Llama3.1-8B)

BaselineVanilla SFT 0.813

Tool calling F1

ValueContextAgent 0.660 vs best baseline 0.580 (Llama3.1-8B)

BaselineVanilla SFT 0.580

Acc-Args (correct structured tool arguments)

ValueContextAgent 0.448 vs best baseline 0.405 (Llama3.1-8B)

BaselineVanilla SFT 0.405

Who Should Care

What To Try In 7 Days

Run a small prototype: collect short egocentric video+audio snippets and extract contexts with a VLM + speech recognizer.

Fine-tune a 7B instruction LLM with a few dozen distilled thought-trace examples (CoT SFT) and test proactive score thresholding.

Integrate one or two external APIs (e.g., weather, GPS, calendar) and test tool-chain correctness on a small scenario set.

Agent Features

Memory

  • persona context (short-term / historical summaries)

Planning

  • think-before-act CoT reasoning
  • sequential tool-chain planning

Tool Use

  • function calling
  • multi-tool chains (20 tool types)

Frameworks

  • SFT
  • in-context learning (ICL) for data gen and baselines

Is Agentic

true

Architectures

  • LLM-based reasoner (fine-tuned LLM)

Optimization Features

Token Efficiency

  • Few-shot ICL baselines use 10-shot demos

Infra Optimization

  • Experiments run on 8 A6000 GPUs

Model Optimization

  • LoRA

Training Optimization

  • SFT
  • AdamW optimizer, cosine scheduler, 5 epochs

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Tool set is limited to 20 predefined APIs and may not cover every real-world need.
  • Benchmarks cover nine scenarios; more diversity is needed for broad deployment.
  • Performance depends on VLM/speech quality; zero-shot VLMs underperform compared to their in-context extraction module.
  • Privacy and consent issues arise from egocentric video/audio collection.

When Not To Use

  • When users cannot give informed consent to wearable sensor collection.
  • In safety-critical domains where automated external actions may cause harm.
  • Where real-time, high-assurance sensing is unavailable or unreliable.

Failure Modes

  • False-positive proactive triggers that annoy or interrupt users.
  • Missed detections that fail to offer timely help.
  • Incorrect tool arguments leading to wrong external actions (Acc-Args sensitivity).
  • Overreliance on noisy VLM outputs that omit proactive cues.

Core Entities

Models

  • Llama-3.1-8B-Instruct
  • Llama-3.1-70B-Instruct
  • Qwen2.5-7B-Instruct
  • Qwen2.5-72B-Instruct
  • DeepSeek-R1-7B
  • GPT-4o
  • Claude Sonnet 4

Metrics

  • Acc-P
  • MD
  • FD
  • RMSE
  • Precision
  • Recall
  • F1-score
  • Acc-Args

Datasets

  • ContextAgentBench
  • ContextAgentBench-Lite

Benchmarks

  • ContextAgentBench