Dataset Construction Papers — Parsed & Scored for Practitioners

A 50B-parameter LLM trained on ~700B tokens, specialized for financial NLP

0.60

0.45

0.80

299

A mid-size LLM trained with a large curated finance corpus yields big real-world gains on finance tasks while staying useful on general tasks, so firms can get domain accuracy without running huge models.

Key finding

Mixed training (curated finance + public data) yields strong finance performance without losing general abilities

Numbers: Training corpus: 363B financial + 345B public ≈ 709B tokens; trained on 569B tokens

ChemLLM: a 7B chemistry-tuned LLM with ChemData (7M Q&A) and ChemBench (4.1k MCQs), matching GPT-4 on core chemical tasks

0.60

0.70

40

A domain-tuned 7B model can match or beat much larger closed models on key chemistry tasks, enabling lower-cost deployment of chemistry assistants and search tools for R&D teams.

Key finding

ChemData size and scope

Numbers: 7M instruction Q&A (authors' dataset summary)

A practical review of where LLM bias comes from, how to test it, and common fixes

0.50

0.30

0.60

13

Biased LLM outputs can cause legal risk, reputational harm, and unfair customer outcomes; fixing bias early reduces downstream remediation cost and regulatory exposure.

Key finding

Toxicity can emerge quickly from benign prompts in generative LLMs.

Numbers: toxicity > 0.5 within <100 generations

TOWER: open LLaMA-2 based multilingual models tuned for translation workflows and competitive with closed LLMs

0.70

0.30

0.60

6

You can run an open 13B model that matches or beats other open models for translation and outperforms closed models on NER and post-editing in some settings, reducing vendor lock-in and inference cost while enabling customization.

Key finding

TOWERINSTRUCT-13B is the best open model for translation and is close to GPT-4 on standard benchmarks.

Numbers: FLORES-200 COMET-22: TOWERINSTRUCT13B 88.88 vs GPT-4 89.13

Fine-tuned Chinese LLM that answers mental-health Q&A using a CBT (therapeutic) response structure

0.30

0.50

0.20

6

Fine-tuning LLMs with therapy‑structured prompts creates more structured, CBT‑aligned replies for Chinese mental‑health Q&A; useful for building triage assistants and clinician support tools but not a replacement for professionals.

Key finding

Created a CBT QA dataset with 22,327 entries.

Numbers: 22,327 entries (Table 1)

MARBLE: a unified benchmark for music audio representations across 18 tasks

0.60

0.50

6

MARBLE gives a single, reproducible way to measure how well audio features transfer to many music tasks, helping teams pick pretrained models or prioritize fine-tuning where it's most needed.

Key finding

MARBLE unifies 18 tasks across 12 datasets to evaluate music representations.

Numbers: 18 tasks; 12 datasets (Table 1).

Use ChatGPT to generate paraphrases and improve open-intent detection on compositionally different test sets

0.40

0.50

6

Adding LLM-generated paraphrases can cheaply raise intent-detection performance under realistic language variation, reducing missed or misrouted user requests and improving conversational UX.

Key finding

Adding ChatGPT paraphrases to BERT+ADB improves overall F1 on compositionally-challenging Banking_CG.

Numbers: F1-All: 54.87 -> 58.90 (+4.03)

Autonomously collect a single rollout to train a NeRF for rendering, mapping and navigation

0.60

0.50

0.40

5

AutoNeRF can automate 3D scene capture for robot deployment, cutting manual data collection and enabling safe simulation-based finetuning of navigation policies from a single short rollout.

Key finding

Modular exploration trained for obstacle/viewpoint coverage yields better RGB rendering than Frontier or E2E RL.

Numbers: PSNR 25.56 (Ours obs.) vs 19.75 (Frontier) on uniform scene poses

CodeS: open-source 1B–15B models that match or beat much larger LLMs on text-to-SQL benchmarks

0.80

0.60

0.75

5

CodeS offers near-SOTA text-to-SQL accuracy with far smaller, open models that cut inference cost and preserve data privacy; use a 7B model for fast local deployment.

Key finding

Incremental SQL-centric pre-training substantially improves SQL generation compared to base StarCoder.

Numbers: CodeS-15B 5-shot Spider TS 73.4% vs StarCoder-15B 70.0% (Table 4)

Use LLM token embeddings plus optional summarization to map job text to standardized occupation codes

0.55

0.65

0.60

5

LLM4Jobs gives a practical unsupervised route to map job text to standard codes with better accuracy than off-the-shelf rule tools, lowering annotation cost and enabling downstream analytics and recommendation systems.

Key finding

LLM4Jobs outperforms unsupervised baselines on evaluated datasets.

Numbers: GenEasy (Level 3) HR@1: LLM4Jobs 0.724 vs CASCOT 0.380 vs GPT-4 0.476

Iteratively generate and verify domain instructions (MatSci-Instruct) to finetune LLaMA into HoneyBee, a materials-science LLM

0.60

0.50

5

You can cheaply create domain-ready LLMs by synthesizing and verifying instruction data, avoiding costly domain pretraining while getting strong task performance.

Key finding

Automatically verified instruction scores correlate well with human experts.

Numbers: Spearman/Pearson correlations 0.6–0.8 vs humans (Fig.4)

Lingshu: a medical multimodal foundation model trained on curated medical+general data with MedEvalKit evaluation

0.55

0.50

4

A carefully curated multimodal medical dataset plus staged tuning produces practical, near-proprietary medical QA and reporting performance while enabling smaller, cheaper models for deployment.

Key finding

Training data scale and mix: 3.75M open-source medical samples + 1.30M synthetic medical samples.

Numbers: 3.75M open + 1.30M synthetic (§2.3)

600k-chart instruction data + a human benchmark to improve multimodal chart QA

0.60

0.45

4

Automate chart reading and QA by fine-tuning multimodal LLMs with domain-specific chart instructions; expect better classification and reasoning but not perfect numeric table extraction.

Key finding

Large instruction corpus improves open-source LMMs on chart tasks.

Numbers: MMCA overall free-form 0.26 vs prior open-source best 0.24 (Table 4)

ECInstruct dataset + eCeLLM models: instruction-tuned LLMs that beat GPT‑4 on many e‑commerce tasks

0.70

0.60

4

A single instruction‑tuned LLM trained on ECInstruct can replace many task‑specific models, improve handling of new products, and reduce engineering cost by centralizing e‑commerce functionality in one adaptable model.

Key finding

Instruction‑tuned eCeLLM models beat the best baselines on in‑domain tests by an average of 10.7%.

Numbers: IND average improvement = 10.7% (Table 3)

Survey: how data choices shape multimodal LLMs — pipelines, filters, and open gaps

0.60

0.40

0.70

4

Better data curation reduces compute and improves multimodal model reliability; selective filtering and high-quality instruction data can cut costs while keeping most performance.

Key finding

Mixing image-caption, interleaved image-text, and text-only data at a 5:5:1 ratio gave best overall vision-language pretraining in a referenced study.

Numbers: ratio 5:5:1 reported by MM1

Panda LLM: small, diverse Chinese instruction data (4.2%) sharply boosts LLaMA-based model reasoning

0.60

0.40

0.70

4

A small, curated instruction dataset can cheaply improve Chinese LLM reasoning; you can boost model utility without retraining on massive new corpora.

Key finding

Instruction-tuning on COIG raised reasoning scores across benchmarks.

Numbers: LogiQA: 27.41 → 31.93 (+4.52); C3-d: 43.02 → 47.30 (+4.28); C3-m: 43.66 → 57.04 (+13.38)

Youku-mPLUG: 10M filtered Chinese video-text pairs plus human benchmarks and models

0.60

0.50

0.70

4

Youku-mPLUG provides a large, safety-filtered Chinese video-text corpus and benchmarks so teams can train or fine-tune Chinese multimodal models faster and compare results fairly.

Key finding

Pretraining on Youku-mPLUG substantially improves category classification.

Numbers: Top-1: 63.51% -> 78.15% (+23.1% relative)

High-quality, LLM-distilled training data + Qwen2-0.5B yields top multilingual embeddings under 0.5B params

0.70

0.40

0.80

3

You can get strong multilingual retrieval and RAG performance with a compact 0.5B model by improving training data quality and using LLM-distilled synthetic examples, lowering cost vs larger models while keeping competitive accuracy.

Key finding

KaLM-embedding-mini-instruct is state-of-the-art for multilingual embeddings under 1B parameters on MTEB.

Numbers: MTEB avg 62.3; zh 64.13; en 64.94; fr 63.08; pl 57.05

Open audit and pipeline show 1%–46% test-set leakage and uneven score inflation across six popular benchmarks

0.60

0.30

0.40

3

Contaminated benchmarks can inflate model metrics and lead teams to pick models that simply memorised test examples rather than truly generalising.

Key finding

Contamination varies strongly by benchmark.

Numbers: C-Eval 45.8%; MMLU 29.1%; HellaSwag 12.4%; ARC 28.7%; CommonsenseQA 1.6%; Winogrande 1.1%

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

0.60

3

A 7B finance-specialized LLM, paired with retrieval and tool pipes, can deliver near-GPT-4 accuracy on many finance tasks at lower model-cost and with controllable hallucination rates, enabling in-house deployment where data control and latency matter.

Key finding

FinTral-DPO-T&R reaches an average score of 0.70 on evaluated text tasks with tools and retrieval.

Numbers: Avg 0.70 (Table 6)

FedLLM-Bench: first realistic, user-split benchmark for federated fine-tuning of LLMs

0.65

0.50

0.60

3

FedLLM-Bench gives engineering teams ready, realistic user-split data and baselines so they can test federated fine-tuning, compare FL optimizers, and measure privacy/utility trade-offs without building custom datasets.

Key finding

Federated training improves average instruction-following compared to local-only training.

Numbers: Fed-ChatbotIT average score: Local 5.00 → FedAvg 5.51 (Δ +0.51) on open metrics

Graphusion: zero-shot LLM pipeline that builds and fuses scientific concept graphs for NLP tutoring

0.60

3

Graphusion cuts expert labeling by using LLMs plus a fusion step to build domain concept graphs, which can immediately improve tutoring and QA services without large supervised datasets.

Key finding

LLM zero-shot link prediction with retrieval outperforms supervised baselines on LectureBankCD (NLP).

Numbers: GPT-4o (RAG) Accuracy 0.8117 vs BERT 0.7088 (+0.1029)

Use vision-language models to auto-generate and iteratively correct multimodal instruction data

0.70

0.60

0.50

3

VIGC can cheaply scale multimodal instruction data and improve model performance on perception and knowledge VQA tasks, reducing the need for costly human annotation while trimming hallucinations through an automated correction loop.

Key finding

Fine-tuning with VIGC COCO data improved LLaVA-7B overall score.

Numbers: Overall 81.0 -> 85.8 (↑4.8)

GlassLLaVA: a vision-language model that interprets SEM images of glass using paper text and GPT-4–generated Q&A

0.40

0.60

0.40

3

Pairing image encoders with LLMs can automate interpretation of lab SEM images and speed defect triage, but the model needs context and domain-specific data to reach reliable accuracy.

Key finding

Context strongly improves answer quality.

Numbers: General: 68.84 (no context) → 92.56 (high context)