EYEGLAXS: fine-tune LLMs with LoRA and FlashAttention to extract summaries from long scientific papers

Overview

Production Readiness

0.6

Novelty Score

0.55

Cost Impact Score

0.35

Citation Count

2

Authors

Léo Hemamou, Mehdi Debiane

Links

Abstract / PDF

Why It Matters For Business

You can get reliable, extractive summaries from large LLMs with modest adapter tuning (LoRA) and modern attention tricks, but expect much higher compute costs for long contexts.

Summary TLDR

EYEGLAXS shows that decoder-only LLMs (LLAMA2-7B, ChatGLM2-6B) can be fine-tuned for long-document extractive summarization using LoRA (low-rank adapters), rotary positional interpolation, and FlashAttention2. On PubMed and arXiv, LoRA-tuned LLMs match or slightly beat strong extractive baselines (ROUGE-1 ≈ 49–50, ROUGE-2 ≈ 21–25). Benefits come with large compute costs (12K context training epochs take ~32–53 hours). Results are promising but tested only on scientific papers and reported from single-run experiments.

Problem Statement

Extractive summarization of long documents is reliable but usually uses encoder-based models; decoder-only LLMs are underused because long contexts and full fine-tuning are costly. The paper asks: can we efficiently adapt LLMs to extractive summarization of long texts using parameter-efficient fine-tuning and attention / positional tweaks?

Main Contribution

EYEGLAXS: a practical pipeline that fine-tunes decoder-only LLMs for extractive summarization using LoRA, Rotary Positional Embeddings (RoPE) interpolation, and FlashAttention2.

Apply LoRA adapters to Q/K/V and output projections so only a small fraction of parameters are trained while backbone weights stay frozen.

Enable longer contexts via RoPE index interpolation and FlashAttention2, allowing evaluation up to 12K tokens (and RoPE scaling toward 32K).

Show competitive or state-of-the-art ROUGE scores on PubMed and arXiv among extractive methods while profiling large compute costs and data-size effects.

Analyze positional selection bias (models favor start/end sentences) and show LoRA fine-tuning materially improves extractive performance vs frozen LLMs.

Key Findings

LoRA fine-tuning substantially improves extractive performance vs frozen LLMs.

NumbersPubMed 4K ChatGLM2: R1 42.79 -> 49.96 (+7.17)

LoRA yields similar gains for a different LLM backbone.

NumbersPubMed 4K LLAMA2-7B: R1 42.38 -> 49.48 (+7.10)

EYEGLAXS matches or slightly exceeds prior extractive baselines on PubMed and arXiv.

NumbersPubMed R1 ≈ 50.3 (LLAMA2-12K) vs GoSum 49.83; ArXiv R1 ≈ 49.02 (ChatGLM2-12K) vs GoSum 48.61

Longer training/evaluation context improves scores but greatly increases runtime.

NumbersChatGLM2 epoch time: 4K -> 8h08m; 12K -> 52h54m (arXiv)

LLMs with LoRA can be more data-efficient than encoder baselines on small training sets.

Models show position bias: selections cluster at beginning and end of documents.

Results

PubMed ROUGE-1 (LLAMA2-7B, 12K)

Value50.34

BaselineGoSum 49.83

PubMed ROUGE-1 (ChatGLM2-6B, 12K)

Value50.17

BaselineGoSum 49.83

ArXiv ROUGE-1 (ChatGLM2-6B, 12K)

Value49.02

BaselineGoSum 48.61

LoRA

ValueR1 frozen=42.79 -> LoRA=49.96

Training time per epoch (ChatGLM2-6B)

Value4K: 8h08m, 12K: 52h54m (arXiv)

Who Should Care

Ml EngineerData ScientistProduct ManagerCtoEngineering Lead

What To Try In 7 Days

Run a pilot: fine-tune a public LLM (LLAMA2-7B or ChatGLM2-6B) with LoRA on a small domain dataset and measure ROUGE.

Replace standard attention with FlashAttention2 to test longer context support on available GPUs.

Profile epoch runtime at 4K and 12K to estimate compute budget before scaling production training.

Optimization Features

Token Efficiency

Mean-pooling sentence representation to limit extra tokens

Infra Optimization

Single A10 GPU processing up to 12K tokens with FlashAttention2

Model Optimization

LoRA
RoPE interpolation for longer positions

System Optimization

Deepspeed stage 1 for training scaling

Training Optimization

LoRA
Gradient checkpointing
bf16 mixed precision
adam8bit optimizer

Inference Optimization

FlashAttention2 to reduce memory for long contexts

Reproducibility

Data Urls

Data Available

Open Source Status

partial

Risks & Boundaries

Limitations

Experiments run once per setting (single-run), so variance and robustness are unmeasured.
High training cost for long contexts (12K training epochs take tens of hours per epoch).
Full fine-tuning is not explored; LoRA limits what can be optimized.
Evaluated only on scientific domains (PubMed, arXiv); generalization to other domains untested.
ChatGLM2 appears sensitive to training data size and may need more labels to match LLAMA2 behavior.

When Not To Use

You have tight GPU/time budgets and cannot afford long-context training.
You need guaranteed behavior in domains not similar to PubMed/arXiv without further validation.
You require end-to-end full fine-tuning or custom backbone changes that LoRA cannot address.
High-stakes applications (legal/medical) where single-run models and limited evaluation are insufficient.

Failure Modes

Position bias: model over-selects sentences near document start and end.
Performance variability unknown due to single-run reporting.
High compute cost may force shorter contexts, reducing recall of mid-document content.
Possible underperformance when domain labeled data is very small (ChatGLM2 sensitivity).