Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

September 6, 20258 min

Overview

Decision SnapshotNeeds Validation

Clear engineering value: strong empirical gains for quantization and moderate pruning wins. Dependence on LLM reliability and specific hardware means some adaptation is needed before wide production rollout.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Sadegh Jafari, Aishwarya Sarkar, Mohiuddin Bilwal, Ali Jannesari

Links

Abstract / PDF / Data

Why It Matters For Business

Profiling-guided, LLM-driven compression automates model tuning for latency and memory limits. It reduces manual trial-and-error and can make large vision models viable on CPU- or memory-constrained edge servers with minimal accuracy loss.

Who Should Care

Summary TLDR

ProfilingAgent is a modular system that uses runtime profiling and LLM-based agents to pick layer-specific structured pruning and dynamic post-training quantization. On vision models (ResNet-101, ViT/DeiT, Swin) the method keeps accuracy near baseline while cutting memory up to ~74% and speeding CPU inference up to ~1.74×. The system is practical for CPU-bound deployments and for iteratively finding pruning/quantization trade-offs without exhaustive grid search.

Problem Statement

Pruning and quantization are often applied uniformly or with simple heuristics. That ignores per-layer runtime bottlenecks (latency, memory) and architectural heterogeneity, leading to suboptimal accuracy-latency-memory trade-offs. Manual tuning is slow and brittle across model families.

Main Contribution

A modular pipeline that collects static (MACs, params) and dynamic (latency, memory) profiling traces and feeds them to LLM agents.

An LLM-guided Analysis Agent that returns structured, layerwise pruning and dynamic quantization recommendations.

Key Findings

ProfilingAgent's quantization achieves large memory savings with tiny accuracy loss.

NumbersMem.Red ≈ 74% and ∆Acc ≤ 0.5% on ImageNet-1K (Table 3,4)

Practical UseIf you need to shrink model storage and speed CPU inference, apply profiling-guided dynamic quantization to all Linear layers first; expect ≈74% smaller model files with <0.5% top-1 drop on evaluated models.

Evidence RefTable 3, Table 4

Quantization yields clear CPU inference speedups vs ONNX baseline.

NumbersSpeedups reported up to 1.74× (ViT-B/16) and 1.73× (DeiT-B/16) (Table 3,4)

Practical UseFor CPU-limited serving, dynamically quantize linear layers (qint8) to get ~1.3–1.7× faster inference vs ONNX dynamic PTQ in their setup.

Evidence RefTable 3, Table 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Quantization memory reduction≈74% (best models)ONNX PTQ ~73%+~12% vs ONNXImageNet-1K (Table 3)ProfilingAgent achieves ~74% Mem.Red across ViT/DeiT/Swin, slightly higher than ONNX.Table 3
Quantization inference speedupup to 1.74×ONNX PTQ1.361.74× across modelsImageNet-1K (Table 3,4)Measured average inference time reduced vs ONNX, giving up to 1.74× on ViT-B/16.Table 3, Table 4

What To Try In 7 Days

Run PyTorch Profiler on your model to gather per-layer latency and memory traces.

Apply full dynamic quantization (qint8) to Linear layers and measure model file size and CPU latency.

Prototype a small agent loop: feed profiling JSON to a prompt (as in Fig.4) and validate suggested layerwise quantization/pruning on a held-out subset.

Agent Features

Memory
Uses profiling traces (tensor sizes, peak memory) as input signals
Planning
Iterative pruning loop with evaluation feedbackLLM-based analysis to generate multi-step compression plans
Tool Use
PyTorch ProfilerHugging Face model/processor retrievalONNX Runtime (baseline)PyTorch quantize_dynamic
Frameworks
Prompt-based LLM reasoning (structured JSON outputs)DependencyGraph for safe structured pruning
Is Agentic

Yes

Architectures

LLM-guided multi-agent pipeline

Modular agents: Acquisition, Profiling, Analysis, Pruning, Quantization, Evaluation, Iterative Pruni

Collaboration
Multiple agents exchange serialized profiling/eval reports

Optimization Features

Infra Optimization
Designed for CPU-bound acceleration; uses PyTorch quantize_dynamic
Model Optimization
Structured channel/head pruningLayer-selective structured pruning using regex patternsDependency-aware pruning to keep model structure valid
System Optimization
Per-layer profiling to identify CPU/GPU bottlenecksEvaluation agent measures end-to-end latency and memory after changes
Training Optimization
No quantization-aware training; pruning evaluated mostly without finetuning
Inference Optimization
Post-training dynamic quantization (qint8) applied to Linear layersProfiling-driven selection of layers to quantize/prune for latency gains

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

ImageNet-1K (public dataset)Imagenette (public subset)CIFAR-10CIFAR-100

Risks & Boundaries

Limitations

Depends on LLM quality: weaker models can produce overly aggressive/pruning plans (Table 5).

Reported gains measured on specific hardware (A100, H200) and CPU setups; effects may differ on other platforms.

When Not To Use

When you cannot run layerwise profiling (no privileged runtime access).

When you require extremely aggressive pruning that must be followed by retraining.

Failure Modes

LLM returns overly aggressive pruning plan and causes sudden accuracy collapse (observed with GPT-4-Turbo).

Channel misalignment after structured pruning increases runtime overhead and slows inference (ResNet example).

Core Entities

Models

ResNet-101ViT-B/16Swin-BaseDeiT-B/16

Metrics

AccuracyMemory reduction (%)Inference latency (s)Parameter count (M)

Datasets

ImageNet-1KImagenetteCIFAR-10CIFAR-100

Benchmarks

ImageNet-1K evaluation (classification)ImagenetteCIFAR-10CIFAR-100