Overview
Production Readiness
0.6
Novelty Score
0.55
Cost Impact Score
0.35
Citation Count
2
Why It Matters For Business
You can get reliable, extractive summaries from large LLMs with modest adapter tuning (LoRA) and modern attention tricks, but expect much higher compute costs for long contexts.
Summary TLDR
EYEGLAXS shows that decoder-only LLMs (LLAMA2-7B, ChatGLM2-6B) can be fine-tuned for long-document extractive summarization using LoRA (low-rank adapters), rotary positional interpolation, and FlashAttention2. On PubMed and arXiv, LoRA-tuned LLMs match or slightly beat strong extractive baselines (ROUGE-1 ≈ 49–50, ROUGE-2 ≈ 21–25). Benefits come with large compute costs (12K context training epochs take ~32–53 hours). Results are promising but tested only on scientific papers and reported from single-run experiments.
Problem Statement
Extractive summarization of long documents is reliable but usually uses encoder-based models; decoder-only LLMs are underused because long contexts and full fine-tuning are costly. The paper asks: can we efficiently adapt LLMs to extractive summarization of long texts using parameter-efficient fine-tuning and attention / positional tweaks?
Main Contribution
EYEGLAXS: a practical pipeline that fine-tunes decoder-only LLMs for extractive summarization using LoRA, Rotary Positional Embeddings (RoPE) interpolation, and FlashAttention2.
Apply LoRA adapters to Q/K/V and output projections so only a small fraction of parameters are trained while backbone weights stay frozen.
Enable longer contexts via RoPE index interpolation and FlashAttention2, allowing evaluation up to 12K tokens (and RoPE scaling toward 32K).
Show competitive or state-of-the-art ROUGE scores on PubMed and arXiv among extractive methods while profiling large compute costs and data-size effects.
Analyze positional selection bias (models favor start/end sentences) and show LoRA fine-tuning materially improves extractive performance vs frozen LLMs.
Key Findings
LoRA fine-tuning substantially improves extractive performance vs frozen LLMs.
LoRA yields similar gains for a different LLM backbone.
EYEGLAXS matches or slightly exceeds prior extractive baselines on PubMed and arXiv.
Longer training/evaluation context improves scores but greatly increases runtime.
LLMs with LoRA can be more data-efficient than encoder baselines on small training sets.
Models show position bias: selections cluster at beginning and end of documents.
Results
PubMed ROUGE-1 (LLAMA2-7B, 12K)
PubMed ROUGE-1 (ChatGLM2-6B, 12K)
ArXiv ROUGE-1 (ChatGLM2-6B, 12K)
LoRA
Training time per epoch (ChatGLM2-6B)
Who Should Care
What To Try In 7 Days
Run a pilot: fine-tune a public LLM (LLAMA2-7B or ChatGLM2-6B) with LoRA on a small domain dataset and measure ROUGE.
Replace standard attention with FlashAttention2 to test longer context support on available GPUs.
Profile epoch runtime at 4K and 12K to estimate compute budget before scaling production training.
Optimization Features
Token Efficiency
- Mean-pooling sentence representation to limit extra tokens
Infra Optimization
- Single A10 GPU processing up to 12K tokens with FlashAttention2
Model Optimization
- LoRA
- RoPE interpolation for longer positions
System Optimization
- Deepspeed stage 1 for training scaling
Training Optimization
- LoRA
- Gradient checkpointing
- bf16 mixed precision
- adam8bit optimizer
Inference Optimization
- FlashAttention2 to reduce memory for long contexts
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments run once per setting (single-run), so variance and robustness are unmeasured.
- High training cost for long contexts (12K training epochs take tens of hours per epoch).
- Full fine-tuning is not explored; LoRA limits what can be optimized.
- Evaluated only on scientific domains (PubMed, arXiv); generalization to other domains untested.
- ChatGLM2 appears sensitive to training data size and may need more labels to match LLAMA2 behavior.
When Not To Use
- You have tight GPU/time budgets and cannot afford long-context training.
- You need guaranteed behavior in domains not similar to PubMed/arXiv without further validation.
- You require end-to-end full fine-tuning or custom backbone changes that LoRA cannot address.
- High-stakes applications (legal/medical) where single-run models and limited evaluation are insufficient.
Failure Modes
- Position bias: model over-selects sentences near document start and end.
- Performance variability unknown due to single-run reporting.
- High compute cost may force shorter contexts, reducing recall of mid-document content.
- Possible underperformance when domain labeled data is very small (ChatGLM2 sensitivity).
Core Entities
Models
- LLAMA2-7B-32K-Instruct
- ChatGLM2-6B-32K
- LoRA
Metrics
- ROUGE-1 F1
- ROUGE-2 F1
- ROUGE-L F1
Datasets
- PubMed (Cohan et al. 2018)
- arXiv (Cohan et al. 2018)
Benchmarks
- ROUGE-1
- ROUGE-2
- ROUGE-L

