A practical pipeline and datasets to adapt general LLMs into telecom-specialized models and benchmarks

July 12, 20247 min

Overview

Production Readiness

0.5

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

7

Authors

Hang Zou, Qiyang Zhao, Yu Tian, Lina Bariah, Faouzi Bader, Thierry Lestable, Merouane Debbah

Links

Abstract / PDF

Why It Matters For Business

Fine-tuning mid-size LLMs on telecom-specific text and tasks gives big practical gains in document understanding, math modeling and code tasks at much lower cost than training from scratch.

Summary TLDR

The authors present a three-stage pipeline (continual pretraining, instruction tuning, alignment tuning) plus three telecom datasets (OpenTelecom, TelecomInstruct, TelecomAlign) to turn general LLMs into telecom-focused LLMs. They build new telecom benchmarks (Telecom Math Modeling, Telecom Open QnA, Telecom Code Tasks) and show that fine-tuned 7–8B models (Llama/Mistral variants) close gaps with much larger SOTA models on telecom math, classification, QA and code tasks. Experiments are small-scale (≤8B models, limited compute) and focus on text-only data.

Problem Statement

Mainstream LLMs lack deep telecom knowledge and specific evaluation suites. Training telecom models from scratch is costly. We need a practical, low-cost way to adapt existing LLMs so they understand telecom standards, math models, code and documents and can be measured with telecom-specific benchmarks.

Main Contribution

Design a three-stage adaptation pipeline: telecom continual pretraining, instruction tuning, and alignment tuning (DPO).

Assemble OpenTelecom (≈1.68B tokens) and two task datasets (TelecomInstruct, TelecomAlign) for pretraining, SFT and preference tuning.

Create three new telecom-focused benchmarks: Telecom Math Modeling, Telecom Open QnA (incl. TeleQnA extension), and Telecom Code Tasks, plus a 3GPP Tdoc classification suite.

Fine-tune and evaluate 7–8B models (Llama2-7B, Llama3-8B, Mistral-7B) showing clear gains vs base instruct models and competitive results with larger SOTA on telecom tasks.

Key Findings

Domain adaptation via instruction tuning and alignment improved telecom math equation recovery.

NumbersLlama3-8B-TI-TA MathBERT avg score 49.45 vs GPT-4 49.38; ≥90% cases: 9.52% vs GPT-4 3.77%

Telecom document (3GPP) classification improved substantially after telecom tuning.

NumbersLlama3-8B-TI overall 75.3% vs GPT-4o 38.94% on 16 working-group classification

Continual pretraining on telecom data yielded measurable MCQ gains.

NumbersLlama2-7B accuracy rose ≈4% after continual pretraining on OpenTelecom

Instruction tuning and alignment improve code and open QA relevance.

NumbersCode Rouge1 for Mistral-7B: Instruct 0.3639 → TI 0.5701; Telecom open QA Rouge1 Llama3-8B: Instruct 0.0552 → TI-TA 0.416

Results

Telecom Math Modeling (MathBERT avg)

Value49.45 (Llama3-8B-TI-TA)

Baseline49.38 (GPT-4)

Accuracy

Value75.3% (Llama3-8B-TI)

Baseline38.94% (GPT-4o)

Accuracy

Value≈4% increase (Llama2-7B after TP)

BaselineLlama2-7B before TP

Code Rouge1 (code summary, Mistral-7B)

Value0.5701 (Mistral-7B-TI)

Baseline0.3639 (Mistral-7B-Instruct)

Who Should Care

What To Try In 7 Days

Assemble a small OpenTelecom-style corpus (standards, papers, code) and run a brief continual pretrain on your base model.

Create 500–1k practical telecom instruction examples (Tdoc classification, code infill, math modeling) and run QLoRA SFT.

Collect a simple preference set and run DPO to make outputs concise and aligned for engineers.

Optimization Features

Infra Optimization

  • SFT

Model Optimization

  • LoRA

System Optimization

  • FSDP for memory-efficient training

Training Optimization

  • Continual pretraining on filtered telecom corpus
  • LoRA

Inference Optimization

  • Discussed system optimizations (KV caching, FlashAttention, MoE) but not experimentally applied

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Experiments limited to model sizes ≤8B due to GPU limits; results may not scale linearly to larger models.
  • Framework and benchmarks handle only text; radio signals and multi-modal inputs are not included.
  • Paper does not publish code or datasets in this preprint, limiting direct reproducibility.

When Not To Use

  • For hard real-time URLLC decision making where extreme latency and guarantees are required.
  • When you need multi-modal (radio-wave) modeling — the system is text-only.
  • If strict regulatory or certified outputs are required without human oversight.

Failure Modes

  • Hallucinations in code or specification answers despite domain tuning.
  • Imbalanced coverage: better on RAN texts than SA (noted uneven Tdoc accuracy).
  • Alignment tuning can slightly reduce MCQ accuracy due to preference selection strategy.

Core Entities

Models

  • Llama2-7B
  • Llama3-8B
  • Mistral-7B
  • GPT-4
  • GPT-3.5

Metrics

  • MathBERT score (semantic equation similarity)
  • Accuracy
  • Rouge (code and open QA)
  • ≥90% and ≥50% MathBERT thresholds

Datasets

  • OpenTelecom
  • TelecomInstruct
  • TelecomAlign
  • TeleQnA (extended)

Benchmarks

  • Telecom Math Modeling
  • Telecom Open QnA
  • Telecom Code Tasks
  • 3GPP Tdoc Classification