Retrieve similar QA examples on the fly so LLMs write correct SPARQL without fine-tuning

July 1, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Jacopo D'Abramo, Andrea Zugarini, Paolo Torroni

Links

Abstract / PDF

Why It Matters For Business

DFSL gives near state-of-the-art KGQA without dataset fine-tuning, cutting training cost and enabling faster deployment across knowledge graphs.

Summary TLDR

The paper introduces DFSL, a way to improve SPARQL generation by retrieving k similar question→SPARQL examples and injecting them into an LLM prompt (in-context learning). DFSL (and a multi-query extension DFSL-MQ) raises F1 substantially across Wikidata/DBpedia benchmarks, matching or beating fine-tuned systems in 3 of 4 datasets. Key caveats: DFSL needs gold entities/relations present and benefits when the training storage contains similar examples; triple-flip errors remain on some datasets.

Problem Statement

Generating SPARQL queries from natural language is brittle. Fine-tuning helps but is expensive and generalizes poorly. The paper asks: can we use in-context learning (ICL — asking a large model to follow examples in the prompt) plus semantic retrieval of similar examples to get SOTA or near-SOTA KGQA performance without fine-tuning?

Main Contribution

Dynamic Few-Shot Learning (DFSL): retrieve k most similar question→SPARQL examples and inject them into the LLM prompt at inference time.

DFSL-MQ: keep multiple beam-search SPARQL hypotheses and select answers with heuristics (First Set and Largest Set) to reduce triple-flip errors.

Extensive evaluation and ablations across four KGQA benchmarks (QALD-9 DB, QALD-9 Plus, QALD-10, LC-QuAD 2.0) and three LLM backbones.

Key Findings

DFSL can turn an unfinetuned LLM into a competitive KGQA system.

NumbersLC-QuAD 2.0 F1: zero-shot 38.40 → DFSL 85.45 (+47.05)

Multi-query generation plus a First-Set selection further increases accuracy.

NumbersQALD-9 Plus F1: DFSL 76.59 → DFSL-MQ (First Set) 84.45 (+7.86 to +8.60 reported)

Entities and relations in the prompt matter a lot.

NumbersQALD-9 DB F1: DFSL 75.14 → DFSL w/o E_q,R_q 49.59 (-25.55)

DFSL beats or ties fine-tuned SOTA in most benchmarks tested.

NumbersQALD-10 F1: TSET-base 51.37 → DFSL-MQ 62.20 (+10.83)

Results

F1 (DFSL, CodeLlama backbone)

ValueQALD-9 Plus 76.59; QALD-10 57.69; LC-QuAD 2.0 85.45; QALD-9 DB 75.14

Baselinefew-shot / zero-shot baselines

F1 (DFSL-MQ with First-Set selection)

ValueQALD-9 Plus 84.45; QALD-10 62.20; LC-QuAD 2.0 89.10; QALD-9 DB 77.89

BaselineDFSL single-query

Ablation: missing entities/relations

ValueQALD-9 DB DFSL 75.14 → DFSL w/o E_q,R_q 49.59

BaselineDFSL with gold E_q and R_q

Who Should Care

What To Try In 7 Days

Assemble a storage of question→SPARQL examples from your KG training set.

Encode inputs with all-mpnet-base-v2 using 'question + entities + relations' and index by cosine similarity.

At inference, retrieve top-k=5 similar examples and append them as demonstrations to an LLM prompt (k=5 is a good trade).

Reproducibility

Data Urls

  • QALD-9 DB
  • QALD-9 Plus
  • QALD-10
  • LC-QuAD 2.0

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • All experiments use English datasets; multilingual behavior is untested.
  • Possible data contamination from LLM pretraining is acknowledged but not measured.
  • Only large LLMs were evaluated; small-model performance is untested.
  • Embedding strategy was limited to one sentence-transformer; other encoders or similarity metrics not explored.

When Not To Use

  • When you cannot extract reliable entities or relations from the question, since performance drops sharply.
  • When your training storage lacks similar examples to the target questions (out-of-distribution test sets).
  • When you must run on small LLMs without prior validation; behavior is unknown.

Failure Modes

  • Triple-flip: model swaps subject/object in triples; only partially mitigated by multi-query strategies.
  • Largest-Set heuristic can pick under-constrained queries that return many irrelevant results.
  • If storage contains unrelated or noisy examples, retrieval may hurt or not help (no gain on QALD-10 when train/test distributions differ).

Core Entities

Models

  • Mixtral 8x7B
  • Llama-3 70B
  • CodeLlama 70B

Metrics

  • F1

Datasets

  • QALD-9 DB
  • QALD-9 Plus
  • QALD-10
  • LC-QuAD 2.0

Benchmarks

  • QALD series
  • LC-QuAD 2.0