Use tool-calling to distill dialog into search queries and boost medical evidence retrieval.

April 27, 20247 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Zhongzhen Huang, Kui Xue, Yongqi Fan, Linjie Mu, Ruoyu Liu, Tong Ruan, Shaoting Zhang, Xiaofan Zhang

Links

Abstract / PDF

Why It Matters For Business

Converting long patient dialogs into crisp search queries with an LLM tool call improves retrieval of official drug information, letting products deliver more evidence-backed medication advice without trusting LLM memory alone.

Summary TLDR

The authors introduce MedicineQA, a 300-sample multi-round medication-consultation benchmark, and RagPULSE, a retrieval-augmented LLM pipeline that uses 'tool calling' to turn dialogue history into search queries (Distill-Retrieve-Read). Fine-tuned on a synthetic distillation dataset, RagPULSE improves evidence retrieval (HR@1) and generation quality versus several open models and rivals commercial products on their benchmark. The work highlights that naive Retrieve-then-Read often fails on long, noisy medical dialogs and that query distillation helps practical search accuracy.

Problem Statement

Medical consultation dialogs are long, noisy, and use lay terms. Standard Retrieve-then-Read RAG pipelines often fail to construct effective search queries from multi-turn history, causing missed evidence and hallucinations in answers.

Main Contribution

MedicineQA: a 300 multi-round-dialog benchmark for medication consultation with document- and attribute-level retrieval labels.

RagPULSE: a Distill-Retrieve-Read pipeline that uses tool-calling (LLM-generated search queries) to query an entity-oriented medicine database.

A synthetic dataset and fine-tuning procedure that teaches the LLM to distill dialogue history into concise search queries, which improves retrieval and answer quality.

Key Findings

Tool-calling distillation improves coarse document retrieval (HR@1).

NumbersRagPULSE (7B) document HR@1 = 63.67% vs PULSE (7B) = 53.00% (+10.67 pp).

Fine-grained attribute retrieval also improves with distillation.

NumbersRagPULSE (7B) attribute HR@1 = 28.33% vs PULSE (7B) = 18.00% (+10.33 pp).

RagPULSE matches or exceeds some commercial baselines on the benchmark.

NumbersRagPULSE (20B) doc HR@1 = 65.67% vs ChatGPT3.5 doc HR@1 = 63.67%; RagPULSE (20B) Elo = 1074 (rank 1).

Results

Document retrieval HR@1 (coarse)

ValueRagPULSE (7B) = 63.67%

BaselinePULSE (7B) = 53.00%

Attribute retrieval HR@1 (fine)

ValueRagPULSE (7B) = 28.33%

BaselinePULSE (7B) = 18.00%

Elo rating (generation quality judged by GPT-4)

ValueRagPULSE (20B) = 1074 (rank 1)

BaselineChatGPT3.5 = 1072 (rank 2)

Who Should Care

What To Try In 7 Days

Fine-tune an LLM to output short search queries (tool calls) from multi-turn dialog using a small synthetic distillation set.

Index an entity-oriented medicine DB (brand/generic + attributes) and measure HR@1 on a held-out set.

Compare answers with and without retrieved evidence using pairwise human or GPT-4 judging to estimate practical quality lift.

Agent Features

Memory

  • retrieval memory (external DB)

Planning

  • Distill then retrieve then read

Tool Use

  • search_engine tool calling
  • function calling (generate structured query)

Frameworks

  • Distill-Retrieve-Read

Is Agentic

true

Architectures

  • PULSE (base LLM used)

Optimization Features

System Optimization

  • single-machine 8x A100 feasible for PULSE

Training Optimization

  • mixed-precision (BFloat16 forward/backward, Float32 optimizer)
  • tensor parallelism and ZeRO data parallelism

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • MedicineQA is small (300 cases) and focused on 200 common medicines, so results may not generalize to all drugs or rare cases.
  • Benchmark and evaluation rely on GPT-4 and Elo scoring, which can introduce judge bias.
  • Paper does not publish code or dataset URLs in the text, limiting reproducibility.

When Not To Use

  • In high-stakes clinical decisions without human oversight or validated local DB.
  • When no structured, authoritative medicine database is available.
  • For tasks that require long-term patient records beyond single-session retrieval.

Failure Modes

  • LLM produces incomplete or imprecise search keywords, causing missed evidence.
  • Retrieved evidence lacks the exact attribute needed (low attribute HR@1).
  • Model may still hallucinate or misinterpret retrieved text when evidence is ambiguous.

Core Entities

Models

  • RagPULSE
  • PULSE
  • ChatGPT-3.5
  • Baichuan2
  • QWen2
  • MING
  • ChatGLM3
  • DoctorGLM
  • BianQue2

Metrics

  • HR@1
  • HR@5
  • HR@10
  • Elo rating

Datasets

  • MedicineQA (300 multi-round queries)
  • Synthetic distillation dataset (for fine-tuning)

Benchmarks

  • MedicineQA