Use tool-calling to distill dialog into search queries and boost medical evidence retrieval.

April 27, 20247 min

Overview

Decision SnapshotNeeds Validation

The proposal is pragmatic: teach an LLM to output search queries and index an entity-attribute DB; empirical gains are shown on a small, curated benchmark.

Citations1

Evidence Strength0.60

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 60%

Authors

Zhongzhen Huang, Kui Xue, Yongqi Fan, Linjie Mu, Ruoyu Liu, Tong Ruan, Shaoting Zhang, Xiaofan Zhang

Links

Abstract / PDF

Why It Matters For Business

Converting long patient dialogs into crisp search queries with an LLM tool call improves retrieval of official drug information, letting products deliver more evidence-backed medication advice without trusting LLM memory alone.

Who Should Care

Summary TLDR

The authors introduce MedicineQA, a 300-sample multi-round medication-consultation benchmark, and RagPULSE, a retrieval-augmented LLM pipeline that uses 'tool calling' to turn dialogue history into search queries (Distill-Retrieve-Read). Fine-tuned on a synthetic distillation dataset, RagPULSE improves evidence retrieval (HR@1) and generation quality versus several open models and rivals commercial products on their benchmark. The work highlights that naive Retrieve-then-Read often fails on long, noisy medical dialogs and that query distillation helps practical search accuracy.

Problem Statement

Medical consultation dialogs are long, noisy, and use lay terms. Standard Retrieve-then-Read RAG pipelines often fail to construct effective search queries from multi-turn history, causing missed evidence and hallucinations in answers.

Main Contribution

MedicineQA: a 300 multi-round-dialog benchmark for medication consultation with document- and attribute-level retrieval labels.

RagPULSE: a Distill-Retrieve-Read pipeline that uses tool-calling (LLM-generated search queries) to query an entity-oriented medicine database.

Key Findings

Tool-calling distillation improves coarse document retrieval (HR@1).

NumbersRagPULSE (7B) document HR@1 = 63.67% vs PULSE (7B) = 53.00% (+10.67 pp).

Practical UseFine-tune an LLM to output search queries (tool calls) and you can boost top-1 document retrieval ~10 percentage points on similar medical dialog tasks.

Evidence RefTable 3 (RagPULSE vs PULSE, HR@1)

Fine-grained attribute retrieval also improves with distillation.

NumbersRagPULSE (7B) attribute HR@1 = 28.33% vs PULSE (7B) = 18.00% (+10.33 pp).

Practical UseDistilling queries helps find the right medicine attribute (usage, contraindication) more often; expect double-digit percentage-point gains in attribute-level retrieval.

Evidence RefTable 3 (attribute HR@1)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Document retrieval HR@1 (coarse)RagPULSE (7B) = 63.67%PULSE (7B) = 53.00%+10.67 ppMedicineQATop-1 document hit rate improved after distillation fine-tuneTable 3
Attribute retrieval HR@1 (fine)RagPULSE (7B) = 28.33%PULSE (7B) = 18.00%+10.33 ppMedicineQATop-1 attribute hit rate improved with distillationTable 3

What To Try In 7 Days

Fine-tune an LLM to output short search queries (tool calls) from multi-turn dialog using a small synthetic distillation set.

Index an entity-oriented medicine DB (brand/generic + attributes) and measure HR@1 on a held-out set.

Compare answers with and without retrieved evidence using pairwise human or GPT-4 judging to estimate practical quality lift.

Agent Features

Memory
retrieval memory (external DB)
Planning
Distill then retrieve then read
Tool Use
search_engine tool callingfunction calling (generate structured query)
Frameworks
Distill-Retrieve-Read
Is Agentic

Yes

Architectures
PULSE (base LLM used)

Optimization Features

System Optimization
single-machine 8x A100 feasible for PULSE
Training Optimization
mixed-precision (BFloat16 forward/backward, Float32 optimizer)tensor parallelism and ZeRO data parallelism

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

MedicineQA is small (300 cases) and focused on 200 common medicines, so results may not generalize to all drugs or rare cases.

Benchmark and evaluation rely on GPT-4 and Elo scoring, which can introduce judge bias.

When Not To Use

In high-stakes clinical decisions without human oversight or validated local DB.

When no structured, authoritative medicine database is available.

Failure Modes

LLM produces incomplete or imprecise search keywords, causing missed evidence.

Retrieved evidence lacks the exact attribute needed (low attribute HR@1).

Core Entities

Models

RagPULSEPULSEChatGPT-3.5Baichuan2QWen2MINGChatGLM3DoctorGLMBianQue2

Metrics

HR@1HR@5HR@10Elo rating

Datasets

MedicineQA (300 multi-round queries)Synthetic distillation dataset (for fine-tuning)

Benchmarks

MedicineQA