DUAL: pick samples that are both representative and uncertain to label fewer summaries more effectively

March 2, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.65

Citation Count

0

Authors

Petros Stylianos Giouroukis, Alexios Gidiotis, Grigorios Tsoumakas

Links

Abstract / PDF

Why It Matters For Business

DUAL cuts labeling waste by choosing representative but model-informative documents, improving robustness and lowering selection compute compared to full uncertainty methods.

Summary TLDR

DUAL is a simple active-learning method for abstractive summarization that first picks a small diverse set of candidate documents (via embeddings), then ranks those by model uncertainty (BLEU variance with MC dropout), discards extreme-noise candidates, and mixes in random samples. Across 3 summarization models and 4 datasets, DUAL usually matches or improves over pure uncertainty, pure diversity, and random sampling while selecting fewer outliers and lowering sample-selection time versus full uncertainty-based selection. Code and datasets are public.

Problem Statement

Modern summarization models can reach strong performance with small labeled sets, so choosing which documents to label matters. Existing active-learning methods focus on either uncertainty (risking noisy samples) or diversity (risking limited exploration). For summarization these approaches are inconsistent and often beaten by random sampling.

Main Contribution

DUAL algorithm: combine in-domain diversity (IDDS) with uncertainty (BLEUVar via MC dropout), plus random sampling and an exclusion set to avoid oversampling regions.

Large empirical study: 3 models (BART, PEGASUS, FLAN-T5) on 4 datasets (AESLC, Reddit TIFU, WikiHow, BillSum) with repeated runs (6 seeds) and ROUGE evaluation.

Analysis and visualizations: show why IDDS can get stuck, why uncertainty alone picks outliers, and how DUAL balances diversity and robustness.

Public code and reproducible setup (embeddings, TSDAE domain adaptation, hyperparameters) shared on GitHub.

Key Findings

DUAL frequently matches or yields the best ROUGE-1 among compared strategies on evaluated benchmarks.

NumbersFLAN-T5 AESLC Iter15: DUAL R1=35.57 vs Random 35.51 (Table B2)

DUAL reduces selection of outliers while keeping diversity compared to random or uncertainty-only selection.

NumbersDiversity vs outlier plots show DUAL with mid/high diversity and lower outlier score across all datasets (Fig.4)

DUAL reduces sample-selection time compared to full uncertainty (BAS) by applying MC dropout only on IDDS top-k candidates.

NumbersSelection time examples: AESLC BART selection time 24.86s→20.71s (~17% faster); Reddit FLAN-T5 177.64s→110.50s (~38% off

Pure diversity (IDDS) sometimes gets stuck in one embedding region and can hurt learning later.

NumbersBART on WikiHow: IDDS starts strong but declines after ~40-50 samples and falls below random (learning curves, Fig.2)

Results

ROUGE-1 (example)

ValueFLAN-T5 AESLC Iter15: 35.57

BaselineRandom Iter15: 35.51

ROUGE-1 (example)

ValueBART AESLC Iter15: 27.21

BaselineRandom Iter15: 26.92

ROUGE-1 (counterexample)

ValuePEGASUS AESLC Iter15: 24.34

BaselineIDDS Iter15: 24.74

Selection time

ValueAESLC BART selection: DUAL 20.71s

BaselineBAS 24.86s

Selection time

ValueReddit TIFU FLAN-T5 selection: DUAL 110.50s

BaselineBAS 177.64s

Who Should Care

What To Try In 7 Days

Reproduce DUAL on your summarization task with B=150 and s=10 to check whether labeling fewer, better examples helps.

Compute domain-adapted embeddings (TSDAE) once and use IDDS top-k to limit expensive uncertainty passes.

Tune the uncertainty cap τ to filter noisy candidates and add p≈0.1–0.3 random samples per iteration for exploration.

Optimization Features

Training Optimization

  • Data-efficient selection via active learning

Inference Optimization

  • Limit MC-dropout to IDDS top-k to cut selection cost

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Performance depends on the quality of embeddings and TSDAE domain adaptation.
  • BLEUVar is task-agnostic and may not capture factual or content-preservation uncertainty.
  • Experiments use budgets up to B=150; behavior at much larger scales is untested.
  • DUAL still needs upfront embedding computation and MC-dropout passes on candidates.

When Not To Use

  • When you cannot compute domain embeddings or lack compute for any MC-dropout passes.
  • When labeling budgets are extremely large and random sampling is already adequate.
  • When uncertainty must be measured by specialized human metrics (factuality) not BLEUVar.

Failure Modes

  • If IDDS embeddings are poor, DUAL may still focus on unrepresentative regions despite random sampling.
  • If τ (uncertainty cap) is set incorrectly, algorithm may either include noisy samples or discard all candidates.
  • Overspecialization: excluding top-k neighbors permanently (E set) can remove useful nearby samples in some domains.

Core Entities

Models

  • BART
  • PEGASUS
  • FLAN-T5
  • BERT (for embeddings)
  • MPNet (evaluated but not used)
  • TSDAE (embedding domain adaptation)

Metrics

  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • BLEU variance (BLEUVar)
  • Diversity score (avg Euclidean dist.)
  • Outlier score (KNN density)
  • Sample selection time (s)

Datasets

  • AESLC
  • Reddit TIFU (long)
  • WikiHow
  • BillSum

Benchmarks

  • ROUGE
  • BLEUVar