A short review plus a simple scoring formula to judge LLM output quality

January 23, 20246 min

Overview

Production Readiness

0.3

Novelty Score

0.35

Cost Impact Score

0.5

Citation Count

4

Authors

Rick Rejeleene, Xiaowei Xu, John Talburt

Links

Abstract / PDF

Why It Matters For Business

Low information quality in LLM outputs can cause bad decisions, legal risk, and user distrust; measuring and filtering quality reduces downstream risk and saves money on remediation.

Summary TLDR

This paper reviews how data quality, tokenization, and training scale drive trust problems in large language models (LLMs). It proposes a simple, domain‑agnostic information quality score that combines accuracy, consistency and relevance as weighted factors. The authors survey tokenizers, datasets, scaling laws (Chinchilla, Broken Neural Scaling Laws), and practical mitigations such as filtering, de-duplication, human feedback, and retrieval-based checks. No new experiments or code are released.

Problem Statement

LLMs produce useful text but also unreliable, biased or fabricated outputs. The paper argues information quality failures trace to training data issues (noise, bias, tokenization, duplication), scaling choices, and gaps in verification. It proposes a compact, tunable metric to quantify information quality of generated text.

Main Contribution

Survey of data-quality drivers of LLM trust: tokenization, bias, duplication, dataset mix, and scaling effects.

Proposal of a simple, linear information-quality metric combining accuracy, consistency and relevance with tunable weights.

Discussion of practical data-preprocessing steps (filtering, deduplication, privacy removal) and existing defenses (RLHF, retrieval, hallucination detectors).

Summary of scaling laws (Chinchilla, Broken Neural Scaling Laws) and their implications for dataset vs model sizing.

Key Findings

LLM information quality can be expressed as a weighted sum of three dimensions: accuracy, consistency, relevance.

Large-scale training and dataset choice matter: GPT-3 is cited as 175B parameters trained on ~570 GB of text.

Numbers175B params; 570 GB training data

Commonsense QA performance of LLMs is substantially below humans on cited datasets.

NumbersLLM ~55.9% vs human ~89% on Commonsense QA

Noisy or duplicated training data degrades model quality; removing duplicates improves performance (cited work).

NumbersSBNATION example: validation perplexity 33.34, BLEU 1.78 on noisy data

Results

Accuracy

Value55.9% (LLMs) vs 89% (humans)

Validation perplexity / BLEU on noisy SBNATION data

ValuePerplexity 33.34; BLEU 1.78

Model scale example

ValueGPT-3: 175B parameters; 570 GB data

Who Should Care

What To Try In 7 Days

Implement the paper's simple IQ score (weights for accuracy, consistency, relevance) to triage outputs.

Run a quick audit of tokenization settings and de-duplicate your training/ingestion corpora.

Add a retrieval or browser-check step (search or WebGPT-style) for high-impact queries.

Optimization Features

Token Efficiency

  • Highlights tokenization choice (BPE, WordPiece, unigram, character) affects model length and meaning

Infra Optimization

  • Notes high compute and energy costs; suggests sparsely activated experts to reduce cost

Model Optimization

  • MoE

Training Optimization

  • Emphasizes compute-data scaling trade-offs (Chinchilla compute-optimal regime)

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • No experiments or released code; proposals are conceptual and need validation.
  • Proposed IQ metric is linear and simple; may miss other quality dimensions (safety, privacy, timeliness).
  • Paper depends largely on cited work; limited original empirical evidence.

When Not To Use

  • Do not use the paper's IQ metric as a proven filter in safety-critical systems without validation.
  • Do not treat the survey as a benchmark of model performance; it summarizes literature rather than measure models.

Failure Modes

  • IQ weights may be chosen poorly, letting biased or fluent-but-false text pass.
  • Tokenization mismatches can introduce unseen subword artifacts and reduce factual accuracy.
  • Relying on RLHF alone leaves gaps: costly to maintain and sensitive to label quality.

Core Entities

Models

  • GPT-3
  • GPT-4
  • ChatGPT
  • BERT
  • LLaMA / Llama2
  • Gopher
  • PaLM
  • BLOOM
  • BART
  • T5
  • GLaM

Metrics

  • perplexity
  • BLEU
  • Accuracy
  • negative log-likelihood (loss)

Datasets

  • CommonCrawl
  • WebText / WebText2
  • Books1/Books2
  • Wikipedia
  • The Pile
  • SQuAD
  • GLUE
  • Reddit corpus
  • GitHub code
  • arXiv / scientific corpus
  • WuDaoCorpora

Benchmarks

  • TruthfulQA
  • HaluEval
  • CrowS-Pairs
  • GLUE
  • SQuAD
  • Commonsense QA