Overview
Production Readiness
0.55
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
Context-aware proactive assistants can act without prompts, reducing user friction and automating multi-step tasks by calling real tools; this lowers manual work and enables new hands-free services for wearables.
Summary TLDR
This paper introduces ContextAgent, a system that reads egocentric wearable sensor data (video, audio, notifications), extracts proactive-oriented context and user persona, then uses an LLM fine-tuned with explicit thought traces to decide whether to proactively act and to call external tools in sequence. The authors release ContextAgentBench (1,000 samples + 300-lite with raw sensors) and show ContextAgent improves proactive decision accuracy and tool-calling quality versus multiple baselines, with ablations on modalities, persona, and out-of-domain splits.
Problem Statement
Current proactive LLM agents either only see closed environments (e.g., desktop UIs) or use rule-based triggers. They lack open-world sensory perception and automatic tool-augmented actions, which limits real-world proactive assistance from wearable devices.
Main Contribution
Define context-aware proactive agent task that uses multi-modal wearable data and persona context to trigger tool-based services.
Propose ContextAgent: proactive-oriented context extraction + context-aware reasoner that is fine-tuned with distilled thought traces (think-before-act).
Introduce ContextAgentBench: 1,000 annotated samples across nine daily scenarios and 20 tools, plus a 300-sample 'Lite' set with raw sensor data.
Comprehensive evaluation vs six baselines and 13 LLMs, with modality and ablation studies and out-of-domain tests.
Key Findings
ContextAgent raises proactive-decision accuracy and tool-calling correctness over baselines on the main benchmark.
Vision and audio both matter; losing vision hurts most.
Persona context substantially improves proactive predictions and tool arguments.
ContextAgent generalizes reasonably in OOD splits and can match large proprietary LLM baselines in many metrics.
Results
Accuracy
Tool calling F1
Acc-Args (correct structured tool arguments)
Who Should Care
What To Try In 7 Days
Run a small prototype: collect short egocentric video+audio snippets and extract contexts with a VLM + speech recognizer.
Fine-tune a 7B instruction LLM with a few dozen distilled thought-trace examples (CoT SFT) and test proactive score thresholding.
Integrate one or two external APIs (e.g., weather, GPS, calendar) and test tool-chain correctness on a small scenario set.
Agent Features
Memory
- persona context (short-term / historical summaries)
Planning
- think-before-act CoT reasoning
- sequential tool-chain planning
Tool Use
- function calling
- multi-tool chains (20 tool types)
Frameworks
- SFT
- in-context learning (ICL) for data gen and baselines
Is Agentic
true
Architectures
- LLM-based reasoner (fine-tuned LLM)
Optimization Features
Token Efficiency
- Few-shot ICL baselines use 10-shot demos
Infra Optimization
- Experiments run on 8 A6000 GPUs
Model Optimization
- LoRA
Training Optimization
- SFT
- AdamW optimizer, cosine scheduler, 5 epochs
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Tool set is limited to 20 predefined APIs and may not cover every real-world need.
- Benchmarks cover nine scenarios; more diversity is needed for broad deployment.
- Performance depends on VLM/speech quality; zero-shot VLMs underperform compared to their in-context extraction module.
- Privacy and consent issues arise from egocentric video/audio collection.
When Not To Use
- When users cannot give informed consent to wearable sensor collection.
- In safety-critical domains where automated external actions may cause harm.
- Where real-time, high-assurance sensing is unavailable or unreliable.
Failure Modes
- False-positive proactive triggers that annoy or interrupt users.
- Missed detections that fail to offer timely help.
- Incorrect tool arguments leading to wrong external actions (Acc-Args sensitivity).
- Overreliance on noisy VLM outputs that omit proactive cues.
Core Entities
Models
- Llama-3.1-8B-Instruct
- Llama-3.1-70B-Instruct
- Qwen2.5-7B-Instruct
- Qwen2.5-72B-Instruct
- DeepSeek-R1-7B
- GPT-4o
- Claude Sonnet 4
Metrics
- Acc-P
- MD
- FD
- RMSE
- Precision
- Recall
- F1-score
- Acc-Args
Datasets
- ContextAgentBench
- ContextAgentBench-Lite
Benchmarks
- ContextAgentBench

