PHIA: an agent that uses code + web search to turn wearable time-series into personalized health insights

June 10, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper demonstrates credible improvements with a large human and automatic evaluation, but lacks clinical trials and full external validation for deployment.

Citations7

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/8

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 65%

Authors

Mike A. Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y. McLean, Mark Malhotra, Shwetak Patel, Jiening Zhan, Tim Althoff, Daniel McDuff, Xin Liu

Links

Abstract / PDF

Why It Matters For Business

Agentic LLMs that run verified code and fetch trusted web facts can unlock personalized insights from wearable data—improving product value for health apps while reducing numeric errors and buggy analyses.

Who Should Care

Summary TLDR

The authors build PHIA, an LLM-driven agent that uses iterative planning, Python code execution, and web search to analyze wearable time-series and answer personal health queries. They release synthetic wearable users and 4,000+ query examples. In automatic tests PHIA hits 84% accuracy on objective numeric questions and in human ratings it scores higher than a strong code-generation baseline on open-ended queries (68 vs 52 scaled score), with fewer code errors and the ability to recover from mistakes. The system is promising for personalized, data-driven wellness but is not validated for clinical outcomes and should not be used for diagnosis.

Problem Statement

Wearable devices collect detailed time-series data, but current LLMs struggle to do correct numerical and contextual reasoning on raw wearable data. Users want personalized, actionable insights (e.g., does exercise improve my sleep), which require multi-step numeric analysis, time indexing, and domain knowledge that single-pass LLM responses often fail to provide.

Main Contribution

Introduce PHIA: an agentic framework that combines iterative LLM planning (ReAct), code generation (Python/Pandas), and web search to analyze wearable time-series

Release evaluation data: 4,000 objective queries, ~172 human-evaluated open-ended queries, and 56 synthetic wearable users (4 used in eval) derived from 30k anonymized users

Key Findings

PHIA answers objective numeric wearable queries with high accuracy

Numbers84% exact-match accuracy on 4,000 objective queries

Practical UseUse an agent that runs code when you need precise numeric answers from wearable tables instead of text-only LLM outputs

Evidence RefSection 4.3; Figure 3-A

PHIA gives better open-ended reasoning than a code-only baseline

NumbersOverall reasoning score 68 vs 52 (scaled 0100); 83% of responses rated 'acceptable' or better

Practical UseFor exploratory questions (recommendations, correlations), prefer agentic multi-step systems that can fetch external knowledge

Evidence RefSection 4.3; Figure 3-B

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy84%4000 objective queriesSection 4.3; Figure 3-AFigure 3-A
Accuracy74%PHIA +10pp4000 objective queriesSection 4.3; Figure 3-AFigure 3-A

What To Try In 7 Days

Prototype a small agent pipeline: few-shot LLM + sandboxed Python (Pandas) to answer 50 objective wearable queries

Add a vetted web-search step for recommendations and compare ratings from 3 domain reviewers

Track code error rate and implement an observe-and-repair step to cut failed analyses

Agent Features

Memory
Short-term observations integrated per cycle (Observe stage)No long-term retrieval memory reported
Planning
Iterative multistep planning (Thought stages)Error detection and recovery planning
Tool Use
Python runtime (Pandas) for numeric analysisGoogle Search API for external knowledge
Frameworks
ReActCPAR (for synthetic data generation)
Is Agentic

Yes

Architectures
ReAct agent loop (Thought->Act->Observe)Few-shot example selection using sentence-T5 + K-means
Collaboration
Human-in-the-loop evaluation and few-shot guidance

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

No clinical trials or real-world outcome studies to show behavior or health impact

Evaluations rely largely on synthetic users (though appendix includes some real-user validation)

When Not To Use

Do not use as a clinical diagnostic tool or sole basis for medical treatment

Avoid for conditions that wearables cannot reliably measure (e.g., many internal diseases)

Failure Modes

Hallucinated references to non-existent data columns or metrics

Python code generation errors when indexing or joining tables

Core Entities

Models

Gemini 1.0 UltraGemini 1.5 Pro (appendix validation)GPT-4 (chain-of-thought baseline)PH-LLM (comparative baseline)

Metrics

AccuracyScaled human reasoning score (mapped 1–5 to 0–100)Error rate (fraction of code responses that raise errors)Recovery rate (agent fixes after fatal error)Harm avoidance (annotator Yes/No)

Datasets

Objective personal health queries (4,000)Open-ended personal health queries (172 sampled, ~3,000 original)Synthetic wearable users (56 generated from 30k real users; 4 used in eval)