PHIA: an agent that uses code + web search to turn wearable time-series into personalized health insights

Overview

Decision SnapshotNeeds Validation

The paper demonstrates credible improvements with a large human and automatic evaluation, but lacks clinical trials and full external validation for deployment.

Citations7

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/8

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 65%

Authors

Mike A. Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y. McLean, Mark Malhotra, Shwetak Patel, Jiening Zhan, Tim Althoff, Daniel McDuff, Xin Liu

Links

Abstract / PDF

Why It Matters For Business

Agentic LLMs that run verified code and fetch trusted web facts can unlock personalized insights from wearable data—improving product value for health apps while reducing numeric errors and buggy analyses.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

The authors build PHIA, an LLM-driven agent that uses iterative planning, Python code execution, and web search to analyze wearable time-series and answer personal health queries. They release synthetic wearable users and 4,000+ query examples. In automatic tests PHIA hits 84% accuracy on objective numeric questions and in human ratings it scores higher than a strong code-generation baseline on open-ended queries (68 vs 52 scaled score), with fewer code errors and the ability to recover from mistakes. The system is promising for personalized, data-driven wellness but is not validated for clinical outcomes and should not be used for diagnosis.

Problem Statement

Wearable devices collect detailed time-series data, but current LLMs struggle to do correct numerical and contextual reasoning on raw wearable data. Users want personalized, actionable insights (e.g., does exercise improve my sleep), which require multi-step numeric analysis, time indexing, and domain knowledge that single-pass LLM responses often fail to provide.

Main Contribution

Introduce PHIA: an agentic framework that combines iterative LLM planning (ReAct), code generation (Python/Pandas), and web search to analyze wearable time-series

Release evaluation data: 4,000 objective queries, ~172 human-evaluated open-ended queries, and 56 synthetic wearable users (4 used in eval) derived from 30k anonymized users

Key Findings

PHIA answers objective numeric wearable queries with high accuracy

Numbers84% exact-match accuracy on 4,000 objective queries

Practical UseUse an agent that runs code when you need precise numeric answers from wearable tables instead of text-only LLM outputs

Evidence RefSection 4.3; Figure 3-A

PHIA gives better open-ended reasoning than a code-only baseline

NumbersOverall reasoning score 68 vs 52 (scaled 0–100); 83% of responses rated 'acceptable' or better

Practical UseFor exploratory questions (recommendations, correlations), prefer agentic multi-step systems that can fetch external knowledge

Evidence RefSection 4.3; Figure 3-B

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	84%	—	—	4000 objective queries	Section 4.3; Figure 3-A	Figure 3-A
Accuracy	74%	—	PHIA +10pp	4000 objective queries	Section 4.3; Figure 3-A	Figure 3-A

What To Try In 7 Days

Prototype a small agent pipeline: few-shot LLM + sandboxed Python (Pandas) to answer 50 objective wearable queries

Add a vetted web-search step for recommendations and compare ratings from 3 domain reviewers

Track code error rate and implement an observe-and-repair step to cut failed analyses

Agent Features

Memory

Short-term observations integrated per cycle (Observe stage)No long-term retrieval memory reported

Planning

Iterative multistep planning (Thought stages)Error detection and recovery planning

Tool Use

Python runtime (Pandas) for numeric analysisGoogle Search API for external knowledge

Frameworks

ReActCPAR (for synthetic data generation)

Is Agentic

Yes

Architectures

ReAct agent loop (Thought->Act->Observe)Few-shot example selection using sentence-T5 + K-means

Collaboration

Human-in-the-loop evaluation and few-shot guidance

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

No clinical trials or real-world outcome studies to show behavior or health impact

Evaluations rely largely on synthetic users (though appendix includes some real-user validation)

When Not To Use

Do not use as a clinical diagnostic tool or sole basis for medical treatment

Avoid for conditions that wearables cannot reliably measure (e.g., many internal diseases)

Failure Modes

Hallucinated references to non-existent data columns or metrics

Python code generation errors when indexing or joining tables

Core Entities

Models

Gemini 1.0 UltraGemini 1.5 Pro (appendix validation)GPT-4 (chain-of-thought baseline)PH-LLM (comparative baseline)

Metrics

AccuracyScaled human reasoning score (mapped 1–5 to 0–100)Error rate (fraction of code responses that raise errors)Recovery rate (agent fixes after fatal error)Harm avoidance (annotator Yes/No)

Datasets

Objective personal health queries (4,000)Open-ended personal health queries (172 sampled, ~3,000 original)Synthetic wearable users (56 generated from 30k real users; 4 used in eval)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

PHIA answers objective numeric wearable queries with high accuracy

PHIA gives better open-ended reasoning than a code-only baseline

Results

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding