Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Overview

Decision SnapshotNeeds Validation

Clear engineering value: strong empirical gains for quantization and moderate pruning wins. Dependence on LLM reliability and specific hardware means some adaptation is needed before wide production rollout.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Sadegh Jafari, Aishwarya Sarkar, Mohiuddin Bilwal, Ali Jannesari

Links

Abstract / PDF / Data

Why It Matters For Business

Profiling-guided, LLM-driven compression automates model tuning for latency and memory limits. It reduces manual trial-and-error and can make large vision models viable on CPU- or memory-constrained edge servers with minimal accuracy loss.

Who Should Care

ML Engineer Product Manager CTO

Summary TLDR

ProfilingAgent is a modular system that uses runtime profiling and LLM-based agents to pick layer-specific structured pruning and dynamic post-training quantization. On vision models (ResNet-101, ViT/DeiT, Swin) the method keeps accuracy near baseline while cutting memory up to ~74% and speeding CPU inference up to ~1.74×. The system is practical for CPU-bound deployments and for iteratively finding pruning/quantization trade-offs without exhaustive grid search.

Problem Statement

Pruning and quantization are often applied uniformly or with simple heuristics. That ignores per-layer runtime bottlenecks (latency, memory) and architectural heterogeneity, leading to suboptimal accuracy-latency-memory trade-offs. Manual tuning is slow and brittle across model families.

Main Contribution

A modular pipeline that collects static (MACs, params) and dynamic (latency, memory) profiling traces and feeds them to LLM agents.

An LLM-guided Analysis Agent that returns structured, layerwise pruning and dynamic quantization recommendations.

Key Findings

ProfilingAgent's quantization achieves large memory savings with tiny accuracy loss.

NumbersMem.Red ≈ 74% and ∆Acc ≤ 0.5% on ImageNet-1K (Table 3,4)

Practical UseIf you need to shrink model storage and speed CPU inference, apply profiling-guided dynamic quantization to all Linear layers first; expect ≈74% smaller model files with <0.5% top-1 drop on evaluated models.

Evidence RefTable 3, Table 4

Quantization yields clear CPU inference speedups vs ONNX baseline.

NumbersSpeedups reported up to 1.74× (ViT-B/16) and 1.73× (DeiT-B/16) (Table 3,4)

Practical UseFor CPU-limited serving, dynamically quantize linear layers (qint8) to get ~1.3–1.7× faster inference vs ONNX dynamic PTQ in their setup.

Evidence RefTable 3, Table 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Quantization memory reduction	≈74% (best models)	ONNX PTQ ~73%	+~1–2% vs ONNX	ImageNet-1K (Table 3)	ProfilingAgent achieves ~74% Mem.Red across ViT/DeiT/Swin, slightly higher than ONNX.	Table 3
Quantization inference speedup	up to 1.74×	ONNX PTQ	1.36–1.74× across models	ImageNet-1K (Table 3,4)	Measured average inference time reduced vs ONNX, giving up to 1.74× on ViT-B/16.	Table 3, Table 4

What To Try In 7 Days

Run PyTorch Profiler on your model to gather per-layer latency and memory traces.

Apply full dynamic quantization (qint8) to Linear layers and measure model file size and CPU latency.

Prototype a small agent loop: feed profiling JSON to a prompt (as in Fig.4) and validate suggested layerwise quantization/pruning on a held-out subset.

Agent Features

Memory

Uses profiling traces (tensor sizes, peak memory) as input signals

Planning

Iterative pruning loop with evaluation feedbackLLM-based analysis to generate multi-step compression plans

Tool Use

PyTorch ProfilerHugging Face model/processor retrievalONNX Runtime (baseline)PyTorch quantize_dynamic

Frameworks

Prompt-based LLM reasoning (structured JSON outputs)DependencyGraph for safe structured pruning

Is Agentic

Yes

Architectures

LLM-guided multi-agent pipeline

Modular agents: Acquisition, Profiling, Analysis, Pruning, Quantization, Evaluation, Iterative Pruni

Collaboration

Multiple agents exchange serialized profiling/eval reports

Optimization Features

Infra Optimization

Designed for CPU-bound acceleration; uses PyTorch quantize_dynamic

Model Optimization

Structured channel/head pruningLayer-selective structured pruning using regex patternsDependency-aware pruning to keep model structure valid

System Optimization

Per-layer profiling to identify CPU/GPU bottlenecksEvaluation agent measures end-to-end latency and memory after changes

Training Optimization

No quantization-aware training; pruning evaluated mostly without finetuning

Inference Optimization

Post-training dynamic quantization (qint8) applied to Linear layersProfiling-driven selection of layers to quantize/prune for latency gains

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

ImageNet-1K (public dataset)Imagenette (public subset)CIFAR-10CIFAR-100

Risks & Boundaries

Limitations

Depends on LLM quality: weaker models can produce overly aggressive/pruning plans (Table 5).

Reported gains measured on specific hardware (A100, H200) and CPU setups; effects may differ on other platforms.

When Not To Use

When you cannot run layerwise profiling (no privileged runtime access).

When you require extremely aggressive pruning that must be followed by retraining.

Failure Modes

LLM returns overly aggressive pruning plan and causes sudden accuracy collapse (observed with GPT-4-Turbo).

Channel misalignment after structured pruning increases runtime overhead and slows inference (ResNet example).

Core Entities

Models

ResNet-101ViT-B/16Swin-BaseDeiT-B/16

Metrics

AccuracyMemory reduction (%)Inference latency (s)Parameter count (M)

Datasets

ImageNet-1KImagenetteCIFAR-10CIFAR-100

Benchmarks

ImageNet-1K evaluation (classification)ImagenetteCIFAR-10CIFAR-100

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ProfilingAgent's quantization achieves large memory savings with tiny accuracy loss.

Quantization yields clear CPU inference speedups vs ONNX baseline.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding