Grounding LMs with OpenMath improves math reasoning when retrieval is good

February 19, 20267 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Marcelo Labre

Links

Abstract / PDF

Why It Matters For Business

Formal ontologies can make smaller models more dependable in specialist tasks, but only when retrieval reliably finds relevant definitions; otherwise, augmentation can reduce trust and accuracy.

Summary TLDR

This paper builds a pipeline that injects formal mathematical definitions from the OpenMath ontology into small-to-medium language models (≤9B params) via hybrid retrieval and a cross-encoder reranker. Evaluation on 500 MATH problems shows gains when retrieved definitions are relevant, but irrelevant definitions actively hurt accuracy. Best-of-n sampling often recovers useful context for smaller models. The main bottleneck is retrieval quality and coverage gaps in OpenMath.

Problem Statement

Can formal domain ontologies (OpenMath) be used as reliable external knowledge for language models to reduce hallucination and improve mathematical reasoning? The work asks whether ontology-guided retrieval helps or harms models of varying sizes on a standard math benchmark.

Main Contribution

A full neuro-symbolic pipeline that maps natural-language math problems to OpenMath symbols, using concept extraction, hybrid retrieval (BM25 + dense embeddings), Reciprocal Rank Fusion (RRF), and a cross-encoder reranker.

Empirical evaluation on the MATH 500 subset using three open models (Gemma2-2B, Gemma2-9B, Qwen2.5-Math-7B) comparing baseline prompts to OpenMath-augmented prompts under threshold ablations and greedy vs best-of-n sampling.

Coverage analysis showing OpenMath alignment with MATH 500 by problem type and difficulty, and practical guidance on threshold range and inference mode for maximizing benefit.

Key Findings

OpenMath coverage is limited: only a minority of problems have high-quality matches.

Numbers24.2% problems with max reranker score ≥ 0.5; mean max score 0.2715

Retrieval quality determines whether ontology context helps or hurts accuracy.

NumbersHigh-quality matches (score >0.8) found for 18% of problems; low-confidence (max <0.2) for 52%

Model capacity shapes utility: the smallest model degrades in greedy mode, larger/specialized models benefit.

NumbersGemma2-2B: consistent negative ∆Accuracy in greedy mode; Qwen2.5-Math-7B: consistently positive ∆Accuracy

Best-of-n sampling recovers context and often flips degradation into improvement for smaller models.

NumbersAll models show positive ∆Accuracy at threshold 0.0 in best-of-n mode; Gemma2-2B reverses greedy degradation at T=0.7

Ontology augmentation can speed up answer generation but not always improve correctness.

NumbersQwen2.5-Math-7B: attempt reductions co-occur with accuracy degradation at Level 5

Results

High-quality OpenMath coverage

Value24.2% of problems (max reranker score ≥ 0.5)

Mean max relevance score

Value0.2715 (average max reranker score)

Accuracy

Value+3.7% ∆Accuracy at threshold 0.0 (greedy)

Best-of-n recovery

ValueAll models show positive ∆Accuracy at threshold 0.0 in best-of-n mode

BaselineGreedy mode

Who Should Care

What To Try In 7 Days

Measure your domain coverage: compute reranker relevance between your corpus and a candidate ontology to find coverage gaps.

Implement hybrid retrieval + cross-encoder reranker on a small validation set and compare baseline vs augmented prompts.

Use best-of-n decoding for small models and tune reranker threshold in the 0.3–0.5 range to balance noise and coverage.

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • OpenMath coverage is uneven; only ~24% of problems have high-quality matches.
  • Retrieval remains the main bottleneck; semantic gap between natural language and formal definitions is large.
  • Small models (<~7B) struggle to use injected context in greedy decoding.
  • Geometry and many word problems are poorly represented in OpenMath, causing noise when injected.
  • Normalization pipeline required manual fixes for ~18% of entries impacting reproducibility effort.

When Not To Use

  • Do not inject ontology context for small models in greedy mode without best-of-n sampling.
  • Avoid augmentation when reranker max score is low (<0.2) or coverage is known to be poor.
  • Do not rely on faster convergence alone as evidence of correctness (false confidence risk).

Failure Modes

  • Irrelevant context degrades accuracy by confusing the model ('noise injection').
  • False confidence: models converge faster but remain wrong.
  • Parametric-contextual conflict: specialized models' internal knowledge can clash with external definitions.
  • Threshold selection bias: very high thresholds can bias towards problems where baseline already performs well, reducing marginal gains.

Core Entities

Models

  • Gemma2-2B
  • Gemma2-9B
  • Qwen2.5-Math-7B
  • Qwen3-Reranker-0.6B
  • qwen3-embedding:4b

Metrics

  • Accuracy
  • Attempts
  • AttemptsRatio

Datasets

  • MATH 500 (subset of MATH)

Benchmarks

  • MATH

Context Entities

Models

  • qwen2-math:7b (used for concept extraction)
  • qwen3-embedding:4b (embeddings)

Datasets

  • OpenMath CDs (Content Dictionaries)