Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

September 6, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

0

Authors

Sadegh Jafari, Aishwarya Sarkar, Mohiuddin Bilwal, Ali Jannesari

Links

Abstract / PDF

Why It Matters For Business

Profiling-guided, LLM-driven compression automates model tuning for latency and memory limits. It reduces manual trial-and-error and can make large vision models viable on CPU- or memory-constrained edge servers with minimal accuracy loss.

Summary TLDR

ProfilingAgent is a modular system that uses runtime profiling and LLM-based agents to pick layer-specific structured pruning and dynamic post-training quantization. On vision models (ResNet-101, ViT/DeiT, Swin) the method keeps accuracy near baseline while cutting memory up to ~74% and speeding CPU inference up to ~1.74×. The system is practical for CPU-bound deployments and for iteratively finding pruning/quantization trade-offs without exhaustive grid search.

Problem Statement

Pruning and quantization are often applied uniformly or with simple heuristics. That ignores per-layer runtime bottlenecks (latency, memory) and architectural heterogeneity, leading to suboptimal accuracy-latency-memory trade-offs. Manual tuning is slow and brittle across model families.

Main Contribution

A modular pipeline that collects static (MACs, params) and dynamic (latency, memory) profiling traces and feeds them to LLM agents.

An LLM-guided Analysis Agent that returns structured, layerwise pruning and dynamic quantization recommendations.

An Iterative Pruning Agent that runs multi-round, feedback-guided structured pruning to find better accuracy vs latency trade-offs.

Key Findings

ProfilingAgent's quantization achieves large memory savings with tiny accuracy loss.

NumbersMem.Red ≈ 74% and ∆Acc ≤ 0.5% on ImageNet-1K (Table 3,4)

Quantization yields clear CPU inference speedups vs ONNX baseline.

NumbersSpeedups reported up to 1.74× (ViT-B/16) and 1.73× (DeiT-B/16) (Table 3,4)

Agentic, profiling-aware structured pruning keeps accuracy competitive and sometimes improves it on small datasets.

NumbersExamples: ResNet-101 +1% on Imagenette (Table 1); small memory reductions ~2–7% with <1% accuracy drop on CIFAR (Table 2

LLM choice matters: stronger reasoning yields safer pruning plans.

NumbersGPT-4o kept/increased accuracy (e.g., ResNet +1%), GPT-4-Turbo produced more aggressive plans (ResNet -14%) (Table 5)

Pruning can sometimes slow inference if channels become misaligned.

NumbersObserved slight slowdown for ResNet-101 due to misaligned channels and overhead (Sec. 5.2.2, Fig.5)

Results

Quantization memory reduction

Value≈74% (best models)

BaselineONNX PTQ ~73%

Quantization inference speedup

Valueup to 1.74×

BaselineONNX PTQ

Pruning parameter reduction

Value≈2–3.3% typical (ImageNet)

Baselineuniform 1–20% fixed-ratio baselines

Accuracy

Valuesmall to moderate; often <1% drop, some cases +1–2%

BaselineL1/L2/random baselines

LLM choice effect

ValueGPT-4o safer; GPT-4-Turbo sometimes aggressive

Baselinesame pipeline w/ different LLMs

Who Should Care

What To Try In 7 Days

Run PyTorch Profiler on your model to gather per-layer latency and memory traces.

Apply full dynamic quantization (qint8) to Linear layers and measure model file size and CPU latency.

Prototype a small agent loop: feed profiling JSON to a prompt (as in Fig.4) and validate suggested layerwise quantization/pruning on a held-out subset.

Agent Features

Memory

  • Uses profiling traces (tensor sizes, peak memory) as input signals

Planning

  • Iterative pruning loop with evaluation feedback
  • LLM-based analysis to generate multi-step compression plans

Tool Use

  • PyTorch Profiler
  • Hugging Face model/processor retrieval
  • ONNX Runtime (baseline)
  • PyTorch quantize_dynamic

Frameworks

  • Prompt-based LLM reasoning (structured JSON outputs)
  • DependencyGraph for safe structured pruning

Is Agentic

true

Architectures

  • LLM-guided multi-agent pipeline
  • Modular agents: Acquisition, Profiling, Analysis, Pruning, Quantization, Evaluation, Iterative Pruni

Collaboration

  • Multiple agents exchange serialized profiling/eval reports

Optimization Features

Infra Optimization

  • Designed for CPU-bound acceleration; uses PyTorch quantize_dynamic

Model Optimization

  • Structured channel/head pruning
  • Layer-selective structured pruning using regex patterns
  • Dependency-aware pruning to keep model structure valid

System Optimization

  • Per-layer profiling to identify CPU/GPU bottlenecks
  • Evaluation agent measures end-to-end latency and memory after changes

Training Optimization

  • No quantization-aware training; pruning evaluated mostly without finetuning

Inference Optimization

  • Post-training dynamic quantization (qint8) applied to Linear layers
  • Profiling-driven selection of layers to quantize/prune for latency gains

Reproducibility

Data Urls

  • ImageNet-1K (public dataset)
  • Imagenette (public subset)
  • CIFAR-10
  • CIFAR-100

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Depends on LLM quality: weaker models can produce overly aggressive/pruning plans (Table 5).
  • Reported gains measured on specific hardware (A100, H200) and CPU setups; effects may differ on other platforms.
  • Pruning was mostly evaluated without finetuning; larger pruning ratios may need retraining for acceptable accuracy.
  • Quantization focused on dynamic PTQ for Linear layers—does not cover low-bit or advanced static calibration techniques.

When Not To Use

  • When you cannot run layerwise profiling (no privileged runtime access).
  • When you require extremely aggressive pruning that must be followed by retraining.
  • If your deployment uses hardware with different quantization primitives (e.g., custom accelerators) not targeted by PyTorch dynamic quantization.

Failure Modes

  • LLM returns overly aggressive pruning plan and causes sudden accuracy collapse (observed with GPT-4-Turbo).
  • Channel misalignment after structured pruning increases runtime overhead and slows inference (ResNet example).
  • Profiling noise or batch-size mismatch can mislead the Analysis Agent into wrong layer priorities.

Core Entities

Models

  • ResNet-101
  • ViT-B/16
  • Swin-Base
  • DeiT-B/16

Metrics

  • Accuracy
  • Memory reduction (%)
  • Inference latency (s)
  • Parameter count (M)

Datasets

  • ImageNet-1K
  • Imagenette
  • CIFAR-10
  • CIFAR-100

Benchmarks

  • ImageNet-1K evaluation (classification)
  • Imagenette
  • CIFAR-10
  • CIFAR-100