Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Profiling-guided, LLM-driven compression automates model tuning for latency and memory limits. It reduces manual trial-and-error and can make large vision models viable on CPU- or memory-constrained edge servers with minimal accuracy loss.
Summary TLDR
ProfilingAgent is a modular system that uses runtime profiling and LLM-based agents to pick layer-specific structured pruning and dynamic post-training quantization. On vision models (ResNet-101, ViT/DeiT, Swin) the method keeps accuracy near baseline while cutting memory up to ~74% and speeding CPU inference up to ~1.74×. The system is practical for CPU-bound deployments and for iteratively finding pruning/quantization trade-offs without exhaustive grid search.
Problem Statement
Pruning and quantization are often applied uniformly or with simple heuristics. That ignores per-layer runtime bottlenecks (latency, memory) and architectural heterogeneity, leading to suboptimal accuracy-latency-memory trade-offs. Manual tuning is slow and brittle across model families.
Main Contribution
A modular pipeline that collects static (MACs, params) and dynamic (latency, memory) profiling traces and feeds them to LLM agents.
An LLM-guided Analysis Agent that returns structured, layerwise pruning and dynamic quantization recommendations.
An Iterative Pruning Agent that runs multi-round, feedback-guided structured pruning to find better accuracy vs latency trade-offs.
Key Findings
ProfilingAgent's quantization achieves large memory savings with tiny accuracy loss.
Quantization yields clear CPU inference speedups vs ONNX baseline.
Agentic, profiling-aware structured pruning keeps accuracy competitive and sometimes improves it on small datasets.
LLM choice matters: stronger reasoning yields safer pruning plans.
Pruning can sometimes slow inference if channels become misaligned.
Results
Quantization memory reduction
Quantization inference speedup
Pruning parameter reduction
Accuracy
LLM choice effect
Who Should Care
What To Try In 7 Days
Run PyTorch Profiler on your model to gather per-layer latency and memory traces.
Apply full dynamic quantization (qint8) to Linear layers and measure model file size and CPU latency.
Prototype a small agent loop: feed profiling JSON to a prompt (as in Fig.4) and validate suggested layerwise quantization/pruning on a held-out subset.
Agent Features
Memory
- Uses profiling traces (tensor sizes, peak memory) as input signals
Planning
- Iterative pruning loop with evaluation feedback
- LLM-based analysis to generate multi-step compression plans
Tool Use
- PyTorch Profiler
- Hugging Face model/processor retrieval
- ONNX Runtime (baseline)
- PyTorch quantize_dynamic
Frameworks
- Prompt-based LLM reasoning (structured JSON outputs)
- DependencyGraph for safe structured pruning
Is Agentic
true
Architectures
- LLM-guided multi-agent pipeline
- Modular agents: Acquisition, Profiling, Analysis, Pruning, Quantization, Evaluation, Iterative Pruni
Collaboration
- Multiple agents exchange serialized profiling/eval reports
Optimization Features
Infra Optimization
- Designed for CPU-bound acceleration; uses PyTorch quantize_dynamic
Model Optimization
- Structured channel/head pruning
- Layer-selective structured pruning using regex patterns
- Dependency-aware pruning to keep model structure valid
System Optimization
- Per-layer profiling to identify CPU/GPU bottlenecks
- Evaluation agent measures end-to-end latency and memory after changes
Training Optimization
- No quantization-aware training; pruning evaluated mostly without finetuning
Inference Optimization
- Post-training dynamic quantization (qint8) applied to Linear layers
- Profiling-driven selection of layers to quantize/prune for latency gains
Reproducibility
Data Urls
- ImageNet-1K (public dataset)
- Imagenette (public subset)
- CIFAR-10
- CIFAR-100
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Depends on LLM quality: weaker models can produce overly aggressive/pruning plans (Table 5).
- Reported gains measured on specific hardware (A100, H200) and CPU setups; effects may differ on other platforms.
- Pruning was mostly evaluated without finetuning; larger pruning ratios may need retraining for acceptable accuracy.
- Quantization focused on dynamic PTQ for Linear layers—does not cover low-bit or advanced static calibration techniques.
When Not To Use
- When you cannot run layerwise profiling (no privileged runtime access).
- When you require extremely aggressive pruning that must be followed by retraining.
- If your deployment uses hardware with different quantization primitives (e.g., custom accelerators) not targeted by PyTorch dynamic quantization.
Failure Modes
- LLM returns overly aggressive pruning plan and causes sudden accuracy collapse (observed with GPT-4-Turbo).
- Channel misalignment after structured pruning increases runtime overhead and slows inference (ResNet example).
- Profiling noise or batch-size mismatch can mislead the Analysis Agent into wrong layer priorities.
Core Entities
Models
- ResNet-101
- ViT-B/16
- Swin-Base
- DeiT-B/16
Metrics
- Accuracy
- Memory reduction (%)
- Inference latency (s)
- Parameter count (M)
Datasets
- ImageNet-1K
- Imagenette
- CIFAR-10
- CIFAR-100
Benchmarks
- ImageNet-1K evaluation (classification)
- Imagenette
- CIFAR-10
- CIFAR-100

