DUAL: pick samples that are both representative and uncertain to label fewer summaries more effectively

March 2, 20257 min

Overview

Decision SnapshotReady For Pilot

DUAL is a practical hybrid sampling recipe with reproducible code and multi-model experiments; its gains are consistent but modest and depend on embedding quality and model choice.

Citations0

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 65%

Production readiness: 60%

Novelty: 60%

Authors

Petros Stylianos Giouroukis, Alexios Gidiotis, Grigorios Tsoumakas

Links

Abstract / PDF / Code / Data

Why It Matters For Business

DUAL cuts labeling waste by choosing representative but model-informative documents, improving robustness and lowering selection compute compared to full uncertainty methods.

Who Should Care

Summary TLDR

DUAL is a simple active-learning method for abstractive summarization that first picks a small diverse set of candidate documents (via embeddings), then ranks those by model uncertainty (BLEU variance with MC dropout), discards extreme-noise candidates, and mixes in random samples. Across 3 summarization models and 4 datasets, DUAL usually matches or improves over pure uncertainty, pure diversity, and random sampling while selecting fewer outliers and lowering sample-selection time versus full uncertainty-based selection. Code and datasets are public.

Problem Statement

Modern summarization models can reach strong performance with small labeled sets, so choosing which documents to label matters. Existing active-learning methods focus on either uncertainty (risking noisy samples) or diversity (risking limited exploration). For summarization these approaches are inconsistent and often beaten by random sampling.

Main Contribution

DUAL algorithm: combine in-domain diversity (IDDS) with uncertainty (BLEUVar via MC dropout), plus random sampling and an exclusion set to avoid oversampling regions.

Large empirical study: 3 models (BART, PEGASUS, FLAN-T5) on 4 datasets (AESLC, Reddit TIFU, WikiHow, BillSum) with repeated runs (6 seeds) and ROUGE evaluation.

Key Findings

DUAL frequently matches or yields the best ROUGE-1 among compared strategies on evaluated benchmarks.

NumbersFLAN-T5 AESLC Iter15: DUAL R1=35.57 vs Random 35.51 (Table B2)

Practical UseUse DUAL as a safe default AL strategy for summarization when you want consistent gains across models and datasets; gains may be small per-step but stable.

Evidence RefTable B2 (ROUGE-1 scores, Iter 15), Fig.2

DUAL reduces selection of outliers while keeping diversity compared to random or uncertainty-only selection.

NumbersDiversity vs outlier plots show DUAL with mid/high diversity and lower outlier score across all datasets (Fig.4)

Practical UseIf your labeled set must avoid noisy or atypical examples, DUAL produces cleaner, more representative labeled pools than BAS or pure random sampling.

Evidence RefFigure 4 (Diversity vs Outlier Scores)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ROUGE-1 (example)FLAN-T5 AESLC Iter15: 35.57Random Iter15: 35.51+0.06AESLC, FLAN-T5 (Iter 15)Table B2Table B2
ROUGE-1 (example)BART AESLC Iter15: 27.21Random Iter15: 26.92+0.29AESLC, BART (Iter 15)Table B2Table B2

What To Try In 7 Days

Reproduce DUAL on your summarization task with B=150 and s=10 to check whether labeling fewer, better examples helps.

Compute domain-adapted embeddings (TSDAE) once and use IDDS top-k to limit expensive uncertainty passes.

Tune the uncertainty cap τ to filter noisy candidates and add p≈0.1–0.3 random samples per iteration for exploration.

Optimization Features

Training Optimization
Data-efficient selection via active learning
Inference Optimization
Limit MC-dropout to IDDS top-k to cut selection cost

Reproducibility

Risks & Boundaries

Limitations

Performance depends on the quality of embeddings and TSDAE domain adaptation.

BLEUVar is task-agnostic and may not capture factual or content-preservation uncertainty.

When Not To Use

When you cannot compute domain embeddings or lack compute for any MC-dropout passes.

When labeling budgets are extremely large and random sampling is already adequate.

Failure Modes

If IDDS embeddings are poor, DUAL may still focus on unrepresentative regions despite random sampling.

If τ (uncertainty cap) is set incorrectly, algorithm may either include noisy samples or discard all candidates.

Core Entities

Models

BARTPEGASUSFLAN-T5BERT (for embeddings)MPNet (evaluated but not used)TSDAE (embedding domain adaptation)

Metrics

ROUGE-1ROUGE-2ROUGE-LBLEU variance (BLEUVar)Diversity score (avg Euclidean dist.)Outlier score (KNN density)Sample selection time (s)

Datasets

AESLCReddit TIFU (long)WikiHowBillSum

Benchmarks

ROUGEBLEUVar