DUAL: pick samples that are both representative and uncertain to label fewer summaries more effectively

Overview

Decision SnapshotReady For Pilot

DUAL is a practical hybrid sampling recipe with reproducible code and multi-model experiments; its gains are consistent but modest and depend on embedding quality and model choice.

Citations0

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 65%

Production readiness: 60%

Novelty: 60%

Authors

Petros Stylianos Giouroukis, Alexios Gidiotis, Grigorios Tsoumakas

Links

Abstract / PDF / Code / Data

Why It Matters For Business

DUAL cuts labeling waste by choosing representative but model-informative documents, improving robustness and lowering selection compute compared to full uncertainty methods.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

DUAL is a simple active-learning method for abstractive summarization that first picks a small diverse set of candidate documents (via embeddings), then ranks those by model uncertainty (BLEU variance with MC dropout), discards extreme-noise candidates, and mixes in random samples. Across 3 summarization models and 4 datasets, DUAL usually matches or improves over pure uncertainty, pure diversity, and random sampling while selecting fewer outliers and lowering sample-selection time versus full uncertainty-based selection. Code and datasets are public.

Problem Statement

Modern summarization models can reach strong performance with small labeled sets, so choosing which documents to label matters. Existing active-learning methods focus on either uncertainty (risking noisy samples) or diversity (risking limited exploration). For summarization these approaches are inconsistent and often beaten by random sampling.

Main Contribution

DUAL algorithm: combine in-domain diversity (IDDS) with uncertainty (BLEUVar via MC dropout), plus random sampling and an exclusion set to avoid oversampling regions.

Large empirical study: 3 models (BART, PEGASUS, FLAN-T5) on 4 datasets (AESLC, Reddit TIFU, WikiHow, BillSum) with repeated runs (6 seeds) and ROUGE evaluation.

Key Findings

DUAL frequently matches or yields the best ROUGE-1 among compared strategies on evaluated benchmarks.

NumbersFLAN-T5 AESLC Iter15: DUAL R1=35.57 vs Random 35.51 (Table B2)

Practical UseUse DUAL as a safe default AL strategy for summarization when you want consistent gains across models and datasets; gains may be small per-step but stable.

Evidence RefTable B2 (ROUGE-1 scores, Iter 15), Fig.2

DUAL reduces selection of outliers while keeping diversity compared to random or uncertainty-only selection.

NumbersDiversity vs outlier plots show DUAL with mid/high diversity and lower outlier score across all datasets (Fig.4)

Practical UseIf your labeled set must avoid noisy or atypical examples, DUAL produces cleaner, more representative labeled pools than BAS or pure random sampling.

Evidence RefFigure 4 (Diversity vs Outlier Scores)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ROUGE-1 (example)	FLAN-T5 AESLC Iter15: 35.57	Random Iter15: 35.51	+0.06	AESLC, FLAN-T5 (Iter 15)	Table B2	Table B2
ROUGE-1 (example)	BART AESLC Iter15: 27.21	Random Iter15: 26.92	+0.29	AESLC, BART (Iter 15)	Table B2	Table B2

What To Try In 7 Days

Reproduce DUAL on your summarization task with B=150 and s=10 to check whether labeling fewer, better examples helps.

Compute domain-adapted embeddings (TSDAE) once and use IDDS top-k to limit expensive uncertainty passes.

Tune the uncertainty cap τ to filter noisy candidates and add p≈0.1–0.3 random samples per iteration for exploration.

Optimization Features

Training Optimization

Data-efficient selection via active learning

Inference Optimization

Limit MC-dropout to IDDS top-k to cut selection cost

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/pgiouroukis/dual

Data URLs

https://huggingface.co/datasets/Yale-LILY/aeslc https://huggingface.co/datasets/ctr4si/reddit_tifu https://huggingface.co/datasets/wangwilliamyang/wikihow https://huggingface.co/datasets/FiscalNote/billsum

Risks & Boundaries

Limitations

Performance depends on the quality of embeddings and TSDAE domain adaptation.

BLEUVar is task-agnostic and may not capture factual or content-preservation uncertainty.

When Not To Use

When you cannot compute domain embeddings or lack compute for any MC-dropout passes.

When labeling budgets are extremely large and random sampling is already adequate.

Failure Modes

If IDDS embeddings are poor, DUAL may still focus on unrepresentative regions despite random sampling.

If τ (uncertainty cap) is set incorrectly, algorithm may either include noisy samples or discard all candidates.

Core Entities

Models

BARTPEGASUSFLAN-T5BERT (for embeddings)MPNet (evaluated but not used)TSDAE (embedding domain adaptation)

Metrics

ROUGE-1ROUGE-2ROUGE-LBLEU variance (BLEUVar)Diversity score (avg Euclidean dist.)Outlier score (KNN density)Sample selection time (s)

Datasets

AESLCReddit TIFU (long)WikiHowBillSum

Benchmarks

ROUGEBLEUVar

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

DUAL frequently matches or yields the best ROUGE-1 among compared strategies on evaluated benchmarks.

DUAL reduces selection of outliers while keeping diversity compared to random or uncertainty-only selection.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding