Overview
Production Readiness
0.5
Novelty Score
0.65
Cost Impact Score
0.4
Citation Count
7
Why It Matters For Business
Agentic LLMs that run verified code and fetch trusted web facts can unlock personalized insights from wearable data—improving product value for health apps while reducing numeric errors and buggy analyses.
Summary TLDR
The authors build PHIA, an LLM-driven agent that uses iterative planning, Python code execution, and web search to analyze wearable time-series and answer personal health queries. They release synthetic wearable users and 4,000+ query examples. In automatic tests PHIA hits 84% accuracy on objective numeric questions and in human ratings it scores higher than a strong code-generation baseline on open-ended queries (68 vs 52 scaled score), with fewer code errors and the ability to recover from mistakes. The system is promising for personalized, data-driven wellness but is not validated for clinical outcomes and should not be used for diagnosis.
Problem Statement
Wearable devices collect detailed time-series data, but current LLMs struggle to do correct numerical and contextual reasoning on raw wearable data. Users want personalized, actionable insights (e.g., does exercise improve my sleep), which require multi-step numeric analysis, time indexing, and domain knowledge that single-pass LLM responses often fail to provide.
Main Contribution
Introduce PHIA: an agentic framework that combines iterative LLM planning (ReAct), code generation (Python/Pandas), and web search to analyze wearable time-series
Release evaluation data: 4,000 objective queries, ~172 human-evaluated open-ended queries, and 56 synthetic wearable users (4 used in eval) derived from 30k anonymized users
Large-scale evaluation: 650 human-hours across annotators and experts plus automatic evaluation showing PHIA outperforms non-agent baselines on numeric and open-ended tasks
Key Findings
PHIA answers objective numeric wearable queries with high accuracy
PHIA gives better open-ended reasoning than a code-only baseline
PHIA generates fewer code errors and can recover from some failures
Web search improves domain knowledge ratings
Harmful outputs are rare under the system and review setup
Results
Accuracy
Accuracy
Accuracy
Open-ended overall reasoning score (PHIA)
Open-ended favorable responses
Code error rate
Agent recovery rate after fatal error
Avoidance of harmful outputs
Who Should Care
What To Try In 7 Days
Prototype a small agent pipeline: few-shot LLM + sandboxed Python (Pandas) to answer 50 objective wearable queries
Add a vetted web-search step for recommendations and compare ratings from 3 domain reviewers
Track code error rate and implement an observe-and-repair step to cut failed analyses
Agent Features
Memory
- Short-term observations integrated per cycle (Observe stage)
- No long-term retrieval memory reported
Planning
- Iterative multistep planning (Thought stages)
- Error detection and recovery planning
Tool Use
- Python runtime (Pandas) for numeric analysis
- Google Search API for external knowledge
Frameworks
- ReAct
- CPAR (for synthetic data generation)
Is Agentic
true
Architectures
- ReAct agent loop (Thought->Act->Observe)
- Few-shot example selection using sentence-T5 + K-means
Collaboration
- Human-in-the-loop evaluation and few-shot guidance
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- No clinical trials or real-world outcome studies to show behavior or health impact
- Evaluations rely largely on synthetic users (though appendix includes some real-user validation)
- Single base LLM (Gemini Ultra) used; cross-model generalization not proven
- Query and dataset curation used subjective thresholds (e.g., sampling 172 of ~3,000 open queries)
When Not To Use
- Do not use as a clinical diagnostic tool or sole basis for medical treatment
- Avoid for conditions that wearables cannot reliably measure (e.g., many internal diseases)
- Not a replacement for clinician judgment or validated medical devices
Failure Modes
- Hallucinated references to non-existent data columns or metrics
- Python code generation errors when indexing or joining tables
- Misinterpretation of user intent for ambiguous open-ended queries
- Limited personalization when user context beyond wearable data is needed
Core Entities
Models
- Gemini 1.0 Ultra
- Gemini 1.5 Pro (appendix validation)
- GPT-4 (chain-of-thought baseline)
- PH-LLM (comparative baseline)
Metrics
- Accuracy
- Scaled human reasoning score (mapped 1–5 to 0–100)
- Error rate (fraction of code responses that raise errors)
- Recovery rate (agent fixes after fatal error)
- Harm avoidance (annotator Yes/No)
Datasets
- Objective personal health queries (4,000)
- Open-ended personal health queries (172 sampled, ~3,000 original)
- Synthetic wearable users (56 generated from 30k real users; 4 used in eval)

