Prune, heal, and quantize a 3.8B SLM to run reliable on-device vehicle function-calling at 11 t/s

January 4, 20258 min

Overview

Decision SnapshotNeeds Validation

The paper demonstrates an end-to-end path (prune → heal → SFT → gguf 4-bit → llama.cpp) with measured accuracy and CPU speed on a vehicle head unit, but lacks public code and broad real-world testing.

Citations2

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Yahya Sowti Khiabani, Farris Atif, Chieh Hsu, Sven Stahlmann, Tobias Michels, Sebastian Kramer, Benedikt Heidrich, M. Saquib Sarfraz, Julian Merten, Faezeh Tafazzoli

Links

Abstract / PDF / Data

Why It Matters For Business

You can run a function-calling LLM on existing vehicle CPUs with small memory and faster integration to new features, cutting cloud costs and reducing latency while improving function-call accuracy over a production speech baseline.

Who Should Care

Summary TLDR

The authors show a practical pipeline to shrink and adapt Microsoft Phi-3 mini (3.8B) for in-vehicle function-calling. They remove up to ~2B parameters via structured depth and width pruning, restore capabilities with extended "healing" fine-tuning, add special tokens that map outputs to gRPC vehicle functions, and convert the model to a 4-bit gguf artifact for llama.cpp. Result: function-calling accuracy 0.84–0.88 on their tests, <2GB RAM footprint, and up to 11 tokens/sec on CPU for the 1.8B variant — enabling on-device inference without accelerators.

Problem Statement

Modern vehicles run on many control units and custom APIs. Rule-based integrations are brittle and costly to extend. Small Language Models (SLMs) can simplify integration by mapping natural language to vehicle function calls, but vehicle head units have tight memory, CPU, and latency limits. The paper asks: how to compress and fine-tune an SLM so it fits automotive hardware and reliably triggers vehicle functions?

Main Contribution

A practical pipeline combining structured pruning, extended healing (fine-tuning), and instruction alignment to shrink Phi-3 mini for vehicles.

A synthetic dataset and special-token scheme that maps language outputs to gRPC-based vehicle functions for function-calling.

Key Findings

You can remove roughly 2 billion parameters from Phi-3 mini while keeping task accuracy for function-calling.

NumbersPhi3-3.8B → Phi3-1.8B (≈2B removed)

Practical UseTarget depth-wise pruning first to cut model size by up to ~50%, then heal and fine-tune before deployment to preserve function-call capability.

Evidence RefTable 1; Sec.3.1

Fine-tuned, pruned models achieve high function-calling accuracy on the authors' tests.

Numbersfunction-calling accuracy 0.840.88 (4-bit gguf)

Practical UseYou can deploy small, quantized SLMs for production-like function control and expect better accuracy than the cited production speech baseline (0.75).

Evidence RefTable 4; Sec.4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy0.88production speech system 0.750.13 vs productionsynthetic function-calling test (4-bit gguf)Phi3-2.8B + h long + SFT + LoRA/FFT achieve 0.88Table 4, Sec.4.2
MMLU (4-bit gguf)39.1Phi3-3.8B baseline 39.1Phi3-2.8B ≈ 34.5 (drop ~4.6 points)MMLU benchmarkPhi3-3.8B+LoRA 39.1; Phi3-2.8B+h long+SFT 34.51Table 4, Sec.4.2

What To Try In 7 Days

Prototype with Phi-3 mini: depth prune a contiguous block (n=8) and run short healing.

Create a small synthetic function dataset and add special tokens for function names.

Convert a healed model to gguf, quantize to 4-bit, and test tokens/sec and function exact-match on the target head unit.

Agent Features

Tool Use
function-calling via gRPCspecial-token -> API mapping
Frameworks
llama.cppggml/ggufLoRA
Is Agentic

Yes

Architectures
decoder-only transformer

Optimization Features

Token Efficiency
depth-pruned model: ≈2x token throughput vs width-pruned
Infra Optimization
design for multi-core ARM CPU without accelerators
Model Optimization
depth-wise (block) structured pruningwidth-wise pruning (neurons/heads) appliedhealing via extended fine-tuning
System Optimization
use llama.cpp runtime for lightweight execution
Training Optimization
LoRAInstruction tuning on OpenHermes-2.5
Inference Optimization
4-bit quantization to ggufLoRA

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

fineweb, fineweb-edu, OpenHermes-2.5 (publicly referenced)

Risks & Boundaries

Limitations

Width-wise pruning caused large capability loss; authors advise limit ≈30% parameter removal.

General language understanding degraded on some benchmarks after heavy pruning and fine-tuning.

When Not To Use

If you need full-size LLM capabilities for broad knowledge or reasoning tasks.

For safety-critical vehicle controls until formal safety and verification are complete.

Failure Modes

Model collapse if >30% of layers removed without extensive healing.

Loss of factual knowledge after short healing; requires long healing (15B tokens) to recover.

Core Entities

Models

Phi-3 mini (3.8B)Phi3-2.8B (depth-pruned)Phi3-1.8B (depth+width-pruned)LoRAOctopus v2 (inspiration)

Metrics

function-calling exact-matchMMLU scoreTruthfulQA scoreHellaSwag scoretokens per second (t/s)memory (GB)

Datasets

finewebfineweb-eduOpenHermes-2.5 (instruction tuning)synthetic in-vehicle function-calling dataset (25k pos, 500 neg)

Benchmarks

MMLUHellaSwagTruthfulQAWinograndeARC

Context Entities

Models

Gemma (2B)Mistral (7B)Llama-3 (8B)

Datasets

fineweb-edu (healing)OpenHermes-2.5 (instruction)