Use off-the-shelf LLMs plus arithmetic coding to losslessly compress gradients

September 26, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.8

Cost Impact Score

0.6

Citation Count

0

Authors

Hui-Po Wang, Mario Fritz

Links

Abstract / PDF

Why It Matters For Business

LM-GC can cut gradient bytes by ~6%–17% losslessly, lowering network costs in federated or distributed training, but current runtime is slow and needs systems work before production use.

Summary TLDR

The authors introduce LM-GC: convert 32-bit gradients into grouped hexadecimal text, feed that text to a frozen pre-trained LLM to get token probabilities, and apply arithmetic coding to compress losslessly. Proper serialization (hex + separators) yields up to 38× token savings and improves lossless compression over general-purpose codecs by about 10%–17.2% on evaluated image-model gradients. LM-GC also combines with quantization and sparsification but is currently slow (≈4 hours to compress 28 MB). Code is available.

Problem Statement

Gradient arrays are high-dimensional and structured, but existing lossless compressors lack a strong statistical prior tailored to gradients. Training a gradient-specific generative model is costly. The paper asks: can off-the-shelf LLMs act as zero-shot priors to enable practical lossless gradient compression?

Main Contribution

LM-GC: a pipeline that serializes 32-bit floats into grouped hexadecimal text, queries a frozen LLM for token probabilities, and uses arithmetic coding for lossless compression.

Showed serialization matters: grouped hex with separators dramatically improves token efficiency and compression compared to raw or ISO encodings.

Empirical gains: LM-GC outperforms standard codecs (PNG, FLAC, GZIP, LZMA, FPZIP) by 10%–17.2% on gradients from multiple architectures and datasets.

Demonstrated compatibility with lossy methods (quantization, sparsification) and released code.

Key Findings

LM-GC improves lossless compression vs. best baseline on evaluated datasets.

Numbers17.2% improvement (TinyImageNet vs FPZIP); 5.9% (CIFAR-10); 8.8% (MNIST)

Serializing floats as grouped hexadecimal tokens with separators gives large token savings and affects compression strongly.

Numbers≈38× token efficiency; serialization choices caused up to ~70% compression difference (ISO vs Hs)

Bigger LLMs and larger context windows improve compression performance.

NumbersPerformance improves from 1.1B → 7B models; context sizes up to 4096 tokens raise compression (Fig.2)

Throughput is currently a major bottleneck.

NumbersAbout 4 hours to compress 28 MB with current implementation

Results

Compression improvement over best baseline

Value17.2% (TinyImageNet)

BaselineFPZIP

Compression improvement over best baseline

Value5.9%

BaselineFPZIP

Token efficiency (bytes→tokens)

Value≈38× token savings (hex vs plain)

Baselineplain gradient text

Throughput

Value≈4 hours per 28 MB

Who Should Care

What To Try In 7 Days

Serialize a small gradient checkpoint to grouped hex with spaces and run LM-GC on a Tinyllama model to measure compression vs your current codec.

Profile pipeline to find bottleneck: LLM inference vs arithmetic coding; try a quantized LLM or faster arithmetic coder.

Combine LM-GC with your existing quantization/sparsification to see additive bandwidth savings on a small training run.

Optimization Features

Token Efficiency

  • Grouped hexadecimal serialization with separators
  • Byte grouping aligned to float fields (sign/exponent/mantissa)

Infra Optimization

  • Balance context window size vs HW memory
  • Use A100-like GPUs or optimized inference stacks

Model Optimization

  • Use larger LLMs for better priors

System Optimization

  • Move arithmetic coding to optimized C++ or faster CPU
  • Parallelize token probability computation

Training Optimization

  • Combine with quantization and sparsification

Inference Optimization

  • Quantize LLMs, faster attention, KV-cache optimizations

Reproducibility

Data Urls

  • MNIST: http://yann.lecun.com/exdb/mnist
  • CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar.html
  • TinyImageNet: https://tiny-imagenet.herokuapp.com/

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Current implementation is slow: ~4 hours to compress 28 MB.
  • Experiments limited to image-model gradients (ConvNet, VGG, ResNet, ViT).
  • Only three LLM sizes tested (1.1B–7B); behavior for very large or very small models is unknown.
  • Serialization must match data structure; wrong choices can hurt compression.
  • Arithmetic coding and single-thread CPU parts create practical bottlenecks.

When Not To Use

  • When you need low-latency gradient transfer or real-time training checkpoints.
  • When you lack GPU/LLM inference infrastructure or want minimal CPU overhead.
  • For tiny payloads where encoding overhead may outweigh savings.

Failure Modes

  • Poor serialization (e.g., ISO or no separators) increases compressed size.
  • Small LLM or short context window fails to model dependencies, losing gains.
  • Arithmetic-coder implementation errors or CPU limits cause impractical runtimes.
  • Distribution shift: gradients from very different models/data may not match the LLM prior.

Core Entities

Models

  • Tinyllama 1.1B
  • Openllama 3B
  • LLAMA 2 7B

Metrics

  • compression rate (%)
  • token efficiency (×)
  • bytes compressed
  • throughput (time per MB)

Datasets

  • MNIST
  • CIFAR-10
  • TinyImageNet

Context Entities

Models

  • PNG
  • FLAC
  • GZIP
  • LZMA
  • FPZIP
  • Run-length encoding (RLE)