Use tool-calling to distill dialog into search queries and boost medical evidence retrieval.

Overview

Decision SnapshotNeeds Validation

The proposal is pragmatic: teach an LLM to output search queries and index an entity-attribute DB; empirical gains are shown on a small, curated benchmark.

Citations1

Evidence Strength0.60

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 60%

Authors

Zhongzhen Huang, Kui Xue, Yongqi Fan, Linjie Mu, Ruoyu Liu, Tong Ruan, Shaoting Zhang, Xiaofan Zhang

Links

Abstract / PDF

Why It Matters For Business

Converting long patient dialogs into crisp search queries with an LLM tool call improves retrieval of official drug information, letting products deliver more evidence-backed medication advice without trusting LLM memory alone.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

The authors introduce MedicineQA, a 300-sample multi-round medication-consultation benchmark, and RagPULSE, a retrieval-augmented LLM pipeline that uses 'tool calling' to turn dialogue history into search queries (Distill-Retrieve-Read). Fine-tuned on a synthetic distillation dataset, RagPULSE improves evidence retrieval (HR@1) and generation quality versus several open models and rivals commercial products on their benchmark. The work highlights that naive Retrieve-then-Read often fails on long, noisy medical dialogs and that query distillation helps practical search accuracy.

Problem Statement

Medical consultation dialogs are long, noisy, and use lay terms. Standard Retrieve-then-Read RAG pipelines often fail to construct effective search queries from multi-turn history, causing missed evidence and hallucinations in answers.

Main Contribution

MedicineQA: a 300 multi-round-dialog benchmark for medication consultation with document- and attribute-level retrieval labels.

RagPULSE: a Distill-Retrieve-Read pipeline that uses tool-calling (LLM-generated search queries) to query an entity-oriented medicine database.

Key Findings

Tool-calling distillation improves coarse document retrieval (HR@1).

NumbersRagPULSE (7B) document HR@1 = 63.67% vs PULSE (7B) = 53.00% (+10.67 pp).

Practical UseFine-tune an LLM to output search queries (tool calls) and you can boost top-1 document retrieval ~10 percentage points on similar medical dialog tasks.

Evidence RefTable 3 (RagPULSE vs PULSE, HR@1)

Fine-grained attribute retrieval also improves with distillation.

NumbersRagPULSE (7B) attribute HR@1 = 28.33% vs PULSE (7B) = 18.00% (+10.33 pp).

Practical UseDistilling queries helps find the right medicine attribute (usage, contraindication) more often; expect double-digit percentage-point gains in attribute-level retrieval.

Evidence RefTable 3 (attribute HR@1)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Document retrieval HR@1 (coarse)	RagPULSE (7B) = 63.67%	PULSE (7B) = 53.00%	+10.67 pp	MedicineQA	Top-1 document hit rate improved after distillation fine-tune	Table 3
Attribute retrieval HR@1 (fine)	RagPULSE (7B) = 28.33%	PULSE (7B) = 18.00%	+10.33 pp	MedicineQA	Top-1 attribute hit rate improved with distillation	Table 3

What To Try In 7 Days

Fine-tune an LLM to output short search queries (tool calls) from multi-turn dialog using a small synthetic distillation set.

Index an entity-oriented medicine DB (brand/generic + attributes) and measure HR@1 on a held-out set.

Compare answers with and without retrieved evidence using pairwise human or GPT-4 judging to estimate practical quality lift.

Agent Features

Memory

retrieval memory (external DB)

Planning

Distill then retrieve then read

Tool Use

search_engine tool callingfunction calling (generate structured query)

Frameworks

Distill-Retrieve-Read

Is Agentic

Yes

Architectures

PULSE (base LLM used)

Optimization Features

System Optimization

single-machine 8x A100 feasible for PULSE

Training Optimization

mixed-precision (BFloat16 forward/backward, Float32 optimizer)tensor parallelism and ZeRO data parallelism

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

MedicineQA is small (300 cases) and focused on 200 common medicines, so results may not generalize to all drugs or rare cases.

Benchmark and evaluation rely on GPT-4 and Elo scoring, which can introduce judge bias.

When Not To Use

In high-stakes clinical decisions without human oversight or validated local DB.

When no structured, authoritative medicine database is available.

Failure Modes

LLM produces incomplete or imprecise search keywords, causing missed evidence.

Retrieved evidence lacks the exact attribute needed (low attribute HR@1).

Core Entities

Models

RagPULSEPULSEChatGPT-3.5Baichuan2QWen2MINGChatGLM3DoctorGLMBianQue2

Metrics

HR@1HR@5HR@10Elo rating

Datasets

MedicineQA (300 multi-round queries)Synthetic distillation dataset (for fine-tuning)

Benchmarks

MedicineQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Tool-calling distillation improves coarse document retrieval (HR@1).

Fine-grained attribute retrieval also improves with distillation.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding