Prune, heal, and quantize a 3.8B SLM to run reliable on-device vehicle function-calling at 11 t/s

Overview

Decision SnapshotNeeds Validation

The paper demonstrates an end-to-end path (prune → heal → SFT → gguf 4-bit → llama.cpp) with measured accuracy and CPU speed on a vehicle head unit, but lacks public code and broad real-world testing.

Citations2

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Yahya Sowti Khiabani, Farris Atif, Chieh Hsu, Sven Stahlmann, Tobias Michels, Sebastian Kramer, Benedikt Heidrich, M. Saquib Sarfraz, Julian Merten, Faezeh Tafazzoli

Links

Abstract / PDF / Data

Why It Matters For Business

You can run a function-calling LLM on existing vehicle CPUs with small memory and faster integration to new features, cutting cloud costs and reducing latency while improving function-call accuracy over a production speech baseline.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

The authors show a practical pipeline to shrink and adapt Microsoft Phi-3 mini (3.8B) for in-vehicle function-calling. They remove up to ~2B parameters via structured depth and width pruning, restore capabilities with extended "healing" fine-tuning, add special tokens that map outputs to gRPC vehicle functions, and convert the model to a 4-bit gguf artifact for llama.cpp. Result: function-calling accuracy 0.84–0.88 on their tests, <2GB RAM footprint, and up to 11 tokens/sec on CPU for the 1.8B variant — enabling on-device inference without accelerators.

Problem Statement

Modern vehicles run on many control units and custom APIs. Rule-based integrations are brittle and costly to extend. Small Language Models (SLMs) can simplify integration by mapping natural language to vehicle function calls, but vehicle head units have tight memory, CPU, and latency limits. The paper asks: how to compress and fine-tune an SLM so it fits automotive hardware and reliably triggers vehicle functions?

Main Contribution

A practical pipeline combining structured pruning, extended healing (fine-tuning), and instruction alignment to shrink Phi-3 mini for vehicles.

A synthetic dataset and special-token scheme that maps language outputs to gRPC-based vehicle functions for function-calling.

Key Findings

You can remove roughly 2 billion parameters from Phi-3 mini while keeping task accuracy for function-calling.

NumbersPhi3-3.8B → Phi3-1.8B (≈2B removed)

Practical UseTarget depth-wise pruning first to cut model size by up to ~50%, then heal and fine-tune before deployment to preserve function-call capability.

Evidence RefTable 1; Sec.3.1

Fine-tuned, pruned models achieve high function-calling accuracy on the authors' tests.

Numbersfunction-calling accuracy 0.84–0.88 (4-bit gguf)

Practical UseYou can deploy small, quantized SLMs for production-like function control and expect better accuracy than the cited production speech baseline (0.75).

Evidence RefTable 4; Sec.4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	0.88	production speech system 0.75	↑0.13 vs production	synthetic function-calling test (4-bit gguf)	Phi3-2.8B + h long + SFT + LoRA/FFT achieve 0.88	Table 4, Sec.4.2
MMLU (4-bit gguf)	39.1	Phi3-3.8B baseline 39.1	Phi3-2.8B ≈ 34.5 (drop ~4.6 points)	MMLU benchmark	Phi3-3.8B+LoRA 39.1; Phi3-2.8B+h long+SFT 34.51	Table 4, Sec.4.2

What To Try In 7 Days

Prototype with Phi-3 mini: depth prune a contiguous block (n=8) and run short healing.

Create a small synthetic function dataset and add special tokens for function names.

Convert a healed model to gguf, quantize to 4-bit, and test tokens/sec and function exact-match on the target head unit.

Agent Features

Tool Use

function-calling via gRPCspecial-token -> API mapping

Frameworks

llama.cppggml/ggufLoRA

Is Agentic

Yes

Architectures

decoder-only transformer

Optimization Features

Token Efficiency

depth-pruned model: ≈2x token throughput vs width-pruned

Infra Optimization

design for multi-core ARM CPU without accelerators

Model Optimization

depth-wise (block) structured pruningwidth-wise pruning (neurons/heads) appliedhealing via extended fine-tuning

System Optimization

use llama.cpp runtime for lightweight execution

Training Optimization

LoRAInstruction tuning on OpenHermes-2.5

Inference Optimization

4-bit quantization to ggufLoRA

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

fineweb, fineweb-edu, OpenHermes-2.5 (publicly referenced)

Risks & Boundaries

Limitations

Width-wise pruning caused large capability loss; authors advise limit ≈30% parameter removal.

General language understanding degraded on some benchmarks after heavy pruning and fine-tuning.

When Not To Use

If you need full-size LLM capabilities for broad knowledge or reasoning tasks.

For safety-critical vehicle controls until formal safety and verification are complete.

Failure Modes

Model collapse if >30% of layers removed without extensive healing.

Loss of factual knowledge after short healing; requires long healing (15B tokens) to recover.

Core Entities

Models

Phi-3 mini (3.8B)Phi3-2.8B (depth-pruned)Phi3-1.8B (depth+width-pruned)LoRAOctopus v2 (inspiration)

Metrics

function-calling exact-matchMMLU scoreTruthfulQA scoreHellaSwag scoretokens per second (t/s)memory (GB)

Datasets

finewebfineweb-eduOpenHermes-2.5 (instruction tuning)synthetic in-vehicle function-calling dataset (25k pos, 500 neg)

Benchmarks

MMLUHellaSwagTruthfulQAWinograndeARC

Context Entities

Models

Gemma (2B)Mistral (7B)Llama-3 (8B)

Datasets

fineweb-edu (healing)OpenHermes-2.5 (instruction)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

You can remove roughly 2 billion parameters from Phi-3 mini while keeping task accuracy for function-calling.

Fine-tuned, pruned models achieve high function-calling accuracy on the authors' tests.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

AgentArch: benchmark of 18 agent architectures across 6 LLMs on two enterprise workflows

Key finding

Tool-R0: teach LLMs to call real tools from scratch using Generator–Solver self-play

Key finding