Data Filtering Papers — Parsed & Scored for Practitioners

A simple, compute-efficient loop that generates model outputs, filters them by a learned reward, and fine-tunes the model offline to align L

0.60

0.70

18

ReST boosts model alignment with human preferences using offline data reuse, cutting the compute cost of repeated online RL sampling while improving translation quality and human-rated outputs.

Key finding

Each additional Improve step raises the model's average reward on validation.

Numbers: Figure 3: steady increases across IWSLT, WMT, Web Domain (rewards normalized 0–100)

High-quality, LLM-distilled training data + Qwen2-0.5B yields top multilingual embeddings under 0.5B params

0.70

0.40

0.80

3

You can get strong multilingual retrieval and RAG performance with a compact 0.5B model by improving training data quality and using LLM-distilled synthetic examples, lowering cost vs larger models while keeping competitive accuracy.

Key finding

KaLM-embedding-mini-instruct is state-of-the-art for multilingual embeddings under 1B parameters on MTEB.

Numbers: MTEB avg 62.3; zh 64.13; en 64.94; fr 63.08; pl 57.05

Adaptive system that detects and masks personal data to meet GDPR and CCPA rules

0.70

0.45

0.60

1

Automated, policy-aware PII detection reduces legal risk and audit effort while preserving data utility for ML pipelines.

Key finding

Passport number detection outperforms other tools on evaluated benchmarks.

Numbers: OneShield F1=0.95 (Bench1); Presidio 0.33; Comprehend 0.54

Automatically find and remove hallucinations in machine-generated visual instructions to make multi-modal LLMs more accurate.

0.60

0.50

1

Cleaning synthetic visual instruction data cuts hallucinations and raises real-world reliability of multimodal models, reducing downstream errors and the need for runtime correction.

Key finding

Machine-generated LLaVA data cause frequent hallucinations in tuned MLLMs.

Numbers: 32.6% sentence-level CHAIR_obj when fine-tuned on LLaVA (Table 2).

Pick fine‑tuning data by clustering loss curves of a small proxy model

0.70

0.60

0.80

1

S2L can cut fine‑tuning data by up to ~89% on the evaluated math tasks and halve data/train time in clinical summarization, lowering compute, storage, and labeling costs while keeping or improving accuracy.

Key finding

S2L matches full MathInstruct performance using only ~11% of the data.

Numbers: 11% of MathInstruct (~30K of 262K)

Use a few verified examples plus public LoRA models and instructions to cheaply build task experts via a diversity-aware mixture-of-experts

0.70

0.60

0.70

0

You can build task-specialist LLMs cheaply by reusing public LoRA adapters and a handful of verified examples, cutting data collection and compute vs full finetuning while gaining measurable accuracy improvements.

Key finding

The proposed pipeline yields higher average accuracy than strong MoE baselines on the tested tasks.

Numbers: LLaMA2-7B avg 52.50% vs Arrow 50.68% (+1.82); Mistral-7B avg 72.77% vs Arrow 71.53% (+1.24)

CLEANER: replace failed in-rollout code with model self-corrections to purify trajectories and speed agentic RL

0.60

0.65

0.70

0

CLEANER reduces rollout noise so small, cheaper LLMs learn tool use faster. That lowers compute cost and shortens training cycles while keeping competitive performance.

Key finding

Purified trajectories raise AIME accuracy for 4B model

Numbers: AIME24 Pass@1: 66.7 -> 72.7 (+6.0)

Use a multi-agent LLM pipeline to synthesize 30–90K high‑quality math QA that let 3–8B models match or beat models trained on 400K–2.3M

0.60

0.70

0

You can cut synthetic-data volume by an order of magnitude and keep or improve model math performance. That lowers labeling costs and GPU training time while enabling smaller models to reach stronger production math competence.

Key finding

AgenticMath produces competitive performance with far less data.

Numbers: 30K–90K AgenticMath vs 400K–2.3M baselines (Table 2)

RedWhale: adapt an English LLM to Korean with small-data continual pretraining and tokenizer tweaks

0.60

0.40

0.70

0

You can adapt an English LLM to Korean with far less compute by filtering data, using a Korean-aware tokenizer, initializing new tokens smartly, and staged training—this lowers cost and enables deployments for teams without massive GPU clusters.

Key finding

RedWhale's fine-tuned model (SFT) achieves KoBEST average 80.83%, slightly above EEVE's 79.42% on evaluated tasks.

Numbers: KoBEST AVG: RedWhale-SFT 0.8083 vs EEVE-SFT 0.7942

Models can memorize benchmarks in other languages and still cheat English leaderboards

0.45

0.70

0.50

0

Benchmarks can be silently leaked across languages, inflating model claims. Audit multilingual training data and use generalization checks before productizing a model.

Key finding

Cross-lingual contamination raises benchmark scores substantially.

Numbers: LLaMA3-8B MMLU: 63.82% → 80.62% (Spanish)

Filter noisy ASR correction pairs by two likelihood tests and train the model to be conservative

0.70

0.50

0.60

0

Train EC models to be conservative on noisy auto-paired data to avoid risky, domain-blind edits that degrade real-world ASR. This reduces error rate in OOD scenarios without collecting new labeled data.

Key finding

Unfiltered EC training worsens OOD CER due to overcorrection.

Numbers: Avg CER 11.84 → 12.51; %EC 43.0% (Swallow-Mistral, Table 2)

Use GPT-3.5 to clean MTNT targets and build C-MTNT, a stronger noise benchmark

0.60

0.50

0.60

0

Cleaning noisy reference translations with an LLM yields low-noise evaluation sets that better reveal whether models truly handle noisy input; this helps teams avoid optimistic robustness claims and focus training effort where it actually helps.

Key finding

Bilingual and translation cleaning reduce target-side noise much more than the rule-based correction tool for EN and FR.

Numbers: EN spell/gram per 100 toks: MTNT 1.712 → Bilingual 0.687; FR: MTNT 7.125 → Bilingual 0.552 (Table 2)

ProbDPP: pick diverse data that’s also likely to arrive — and learn reliabilities online

0.60

0.50

0

When some data sources are unreliable, selecting only diverse items can backfire. ProbDPP improves downstream QA and prompt quality by preferring items that are both diverse and likely to be available, reducing wasted context budget under noisy links or flaky tools.

Key finding

Naive expected log-det collapses under independent Bernoulli dropouts.

Send tasks as tiny label payloads: train clients from a shared image pool using <1 MB

0.70

0.60

0.80

0

If clients can store a shared unlabeled image pool, servers can deliver new classification tasks with tiny label-only payloads (<1 MB). This cuts recurring transfer costs drastically and enables operation over very low-bandwidth links.

Key finding

Task transfer with payloads well below 1 MB is practical.

Numbers: Zstd-compressed payload at 1% keep: 85–206 KB (Table 4)

WONDA: turn noisy verifier invariants into compact, verified training data that makes small models match big LLMs for loop-invariant tasks

0.70

0.60

0.70

0

Curating verifier outputs yields compact, verified training data so small models can match large-model verification performance, lowering inference cost and enabling faster, parallel verification pipelines.

Key finding

WONDA-curated dataset size and makeup

Numbers: 7,283 samples; Grade2=4,516; Grade3=2,767