Grounding LMs with OpenMath improves math reasoning when retrieval is good

Overview

Decision SnapshotNeeds Validation

The paper shows a clear mechanism: when retrieval yields relevant OpenMath symbols, injecting definitions helps reasoning; retrieval failure causes harm. The evidence is empirical on MATH 500 with ablations for threshold and sampling mode.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Marcelo Labre

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Formal ontologies can make smaller models more dependable in specialist tasks, but only when retrieval reliably finds relevant definitions; otherwise, augmentation can reduce trust and accuracy.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

This paper builds a pipeline that injects formal mathematical definitions from the OpenMath ontology into small-to-medium language models (≤9B params) via hybrid retrieval and a cross-encoder reranker. Evaluation on 500 MATH problems shows gains when retrieved definitions are relevant, but irrelevant definitions actively hurt accuracy. Best-of-n sampling often recovers useful context for smaller models. The main bottleneck is retrieval quality and coverage gaps in OpenMath.

Problem Statement

Can formal domain ontologies (OpenMath) be used as reliable external knowledge for language models to reduce hallucination and improve mathematical reasoning? The work asks whether ontology-guided retrieval helps or harms models of varying sizes on a standard math benchmark.

Main Contribution

A full neuro-symbolic pipeline that maps natural-language math problems to OpenMath symbols, using concept extraction, hybrid retrieval (BM25 + dense embeddings), Reciprocal Rank Fusion (RRF), and a cross-encoder reranker.

Empirical evaluation on the MATH 500 subset using three open models (Gemma2-2B, Gemma2-9B, Qwen2.5-Math-7B) comparing baseline prompts to OpenMath-augmented prompts under threshold ablations and greedy vs best-of-n sampling.

Key Findings

OpenMath coverage is limited: only a minority of problems have high-quality matches.

Numbers24.2% problems with max reranker score ≥ 0.5; mean max score 0.2715

Practical UseMeasure ontology coverage before augmenting prompts; expect benefit only where coverage ≥ ~0.5.

Evidence RefAppendix A.4 (Coverage tables)

Retrieval quality determines whether ontology context helps or hurts accuracy.

NumbersHigh-quality matches (score >0.8) found for 18% of problems; low-confidence (max <0.2) for 52%

Practical UseInvest in reranking and semantic retrieval first; poor retrieval can actively degrade performance.

Evidence RefB.6 Cross-Encoder Reranking stats

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
High-quality OpenMath coverage	24.2% of problems (max reranker score ≥ 0.5)	—	—	MATH 500	Appendix A.4 Table A.2	Appendix A.4
Mean max relevance score	0.2715 (average max reranker score)	—	—	MATH 500	Appendix A.4 Overall coverage stats	Appendix A.4

What To Try In 7 Days

Measure your domain coverage: compute reranker relevance between your corpus and a candidate ontology to find coverage gaps.

Implement hybrid retrieval + cross-encoder reranker on a small validation set and compare baseline vs augmented prompts.

Use best-of-n decoding for small models and tune reranker threshold in the 0.3–0.5 range to balance noise and coverage.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/labrem/neus2026-labre https://doi.org/10.5281/zenodo.18665030

Data URLs

https://github.com/OpenMath/CDsMATH benchmark (Hendrycks et al., 2021) public dataset

Risks & Boundaries

Limitations

OpenMath coverage is uneven; only ~24% of problems have high-quality matches.

Retrieval remains the main bottleneck; semantic gap between natural language and formal definitions is large.

When Not To Use

Do not inject ontology context for small models in greedy mode without best-of-n sampling.

Avoid augmentation when reranker max score is low (<0.2) or coverage is known to be poor.

Failure Modes

Irrelevant context degrades accuracy by confusing the model ('noise injection').

False confidence: models converge faster but remain wrong.

Core Entities

Models

Gemma2-2BGemma2-9BQwen2.5-Math-7BQwen3-Reranker-0.6Bqwen3-embedding:4b

Metrics

AccuracyAttemptsAttemptsRatio

Datasets

MATH 500 (subset of MATH)

Benchmarks

MATH

Context Entities

Models

qwen2-math:7b (used for concept extraction)qwen3-embedding:4b (embeddings)

Datasets

OpenMath CDs (Content Dictionaries)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

OpenMath coverage is limited: only a minority of problems have high-quality matches.

Retrieval quality determines whether ontology context helps or hurts accuracy.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f​

Key finding

RAG + a 10M‑token Vedanta corpus cuts hallucinations for niche long‑form QA

Key finding

HybridRAG-Bench: contamination-aware tests that force retrieval + multi-hop reasoning over text + knowledge graphs

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f