Overview
The paper shows a clear mechanism: when retrieval yields relevant OpenMath symbols, injecting definitions helps reasoning; retrieval failure causes harm. The evidence is empirical on MATH 500 with ablations for threshold and sampling mode.
Citations0
Evidence Strength0.70
Confidence0.85
Risk Signals12
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
Formal ontologies can make smaller models more dependable in specialist tasks, but only when retrieval reliably finds relevant definitions; otherwise, augmentation can reduce trust and accuracy.
Who Should Care
Summary TLDR
This paper builds a pipeline that injects formal mathematical definitions from the OpenMath ontology into small-to-medium language models (≤9B params) via hybrid retrieval and a cross-encoder reranker. Evaluation on 500 MATH problems shows gains when retrieved definitions are relevant, but irrelevant definitions actively hurt accuracy. Best-of-n sampling often recovers useful context for smaller models. The main bottleneck is retrieval quality and coverage gaps in OpenMath.
Problem Statement
Can formal domain ontologies (OpenMath) be used as reliable external knowledge for language models to reduce hallucination and improve mathematical reasoning? The work asks whether ontology-guided retrieval helps or harms models of varying sizes on a standard math benchmark.
Main Contribution
A full neuro-symbolic pipeline that maps natural-language math problems to OpenMath symbols, using concept extraction, hybrid retrieval (BM25 + dense embeddings), Reciprocal Rank Fusion (RRF), and a cross-encoder reranker.
Empirical evaluation on the MATH 500 subset using three open models (Gemma2-2B, Gemma2-9B, Qwen2.5-Math-7B) comparing baseline prompts to OpenMath-augmented prompts under threshold ablations and greedy vs best-of-n sampling.
Key Findings
OpenMath coverage is limited: only a minority of problems have high-quality matches.
Retrieval quality determines whether ontology context helps or hurts accuracy.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| High-quality OpenMath coverage | 24.2% of problems (max reranker score ≥ 0.5) | — | — | MATH 500 | Appendix A.4 Table A.2 | Appendix A.4 |
| Mean max relevance score | 0.2715 (average max reranker score) | — | — | MATH 500 | Appendix A.4 Overall coverage stats | Appendix A.4 |
What To Try In 7 Days
Measure your domain coverage: compute reranker relevance between your corpus and a candidate ontology to find coverage gaps.
Implement hybrid retrieval + cross-encoder reranker on a small validation set and compare baseline vs augmented prompts.
Use best-of-n decoding for small models and tune reranker threshold in the 0.3–0.5 range to balance noise and coverage.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
OpenMath coverage is uneven; only ~24% of problems have high-quality matches.
Retrieval remains the main bottleneck; semantic gap between natural language and formal definitions is large.
When Not To Use
Do not inject ontology context for small models in greedy mode without best-of-n sampling.
Avoid augmentation when reranker max score is low (<0.2) or coverage is known to be poor.
Failure Modes
Irrelevant context degrades accuracy by confusing the model ('noise injection').
False confidence: models converge faster but remain wrong.

