Prune, heal, and quantize a 3.8B SLM to run reliable on-device vehicle function-calling at 11 t/s

January 4, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

2

Authors

Yahya Sowti Khiabani, Farris Atif, Chieh Hsu, Sven Stahlmann, Tobias Michels, Sebastian Kramer, Benedikt Heidrich, M. Saquib Sarfraz, Julian Merten, Faezeh Tafazzoli

Links

Abstract / PDF

Why It Matters For Business

You can run a function-calling LLM on existing vehicle CPUs with small memory and faster integration to new features, cutting cloud costs and reducing latency while improving function-call accuracy over a production speech baseline.

Summary TLDR

The authors show a practical pipeline to shrink and adapt Microsoft Phi-3 mini (3.8B) for in-vehicle function-calling. They remove up to ~2B parameters via structured depth and width pruning, restore capabilities with extended "healing" fine-tuning, add special tokens that map outputs to gRPC vehicle functions, and convert the model to a 4-bit gguf artifact for llama.cpp. Result: function-calling accuracy 0.84–0.88 on their tests, <2GB RAM footprint, and up to 11 tokens/sec on CPU for the 1.8B variant — enabling on-device inference without accelerators.

Problem Statement

Modern vehicles run on many control units and custom APIs. Rule-based integrations are brittle and costly to extend. Small Language Models (SLMs) can simplify integration by mapping natural language to vehicle function calls, but vehicle head units have tight memory, CPU, and latency limits. The paper asks: how to compress and fine-tune an SLM so it fits automotive hardware and reliably triggers vehicle functions?

Main Contribution

A practical pipeline combining structured pruning, extended healing (fine-tuning), and instruction alignment to shrink Phi-3 mini for vehicles.

A synthetic dataset and special-token scheme that maps language outputs to gRPC-based vehicle functions for function-calling.

A deployment path using LoRA/FFT merge, convert-to-gguf, 4-bit quantization, and llama.cpp to run on vehicle head units with <2GB RAM and real-time speeds.

Key Findings

You can remove roughly 2 billion parameters from Phi-3 mini while keeping task accuracy for function-calling.

NumbersPhi3-3.8B → Phi3-1.8B (≈2B removed)

Fine-tuned, pruned models achieve high function-calling accuracy on the authors' tests.

Numbersfunction-calling accuracy 0.84–0.88 (4-bit gguf)

A pruned, quantized 1.8B model runs at about 11 tokens per second on CPU without hardware acceleration.

Numbers11.02 tokens/sec (Phi3 1.8B, 4-bit gguf)

4-bit gguf conversion fits the pruned model into a small memory budget.

Numbers<2 GB RAM for pruned Phi3 in gguf 4-bit

Depth-wise (layer/block) pruning is much less harmful than aggressive width-wise pruning.

NumbersWidth pruning lowered avg benchmark scores to 0.42 (Phi3-1.8B) vs baseline 0.59

Results

Accuracy

Value0.88

Baselineproduction speech system 0.75

MMLU (4-bit gguf)

Value39.1

BaselinePhi3-3.8B baseline 39.1

Token generation speed (t/s) on CPU

Value11.02

BaselinePhi3 base 6.76 t/s

Who Should Care

What To Try In 7 Days

Prototype with Phi-3 mini: depth prune a contiguous block (n=8) and run short healing.

Create a small synthetic function dataset and add special tokens for function names.

Convert a healed model to gguf, quantize to 4-bit, and test tokens/sec and function exact-match on the target head unit.

Agent Features

Tool Use

  • function-calling via gRPC
  • special-token -> API mapping

Frameworks

  • llama.cpp
  • ggml/gguf
  • LoRA

Is Agentic

true

Architectures

  • decoder-only transformer

Optimization Features

Token Efficiency

  • depth-pruned model: ≈2x token throughput vs width-pruned

Infra Optimization

  • design for multi-core ARM CPU without accelerators

Model Optimization

  • depth-wise (block) structured pruning
  • width-wise pruning (neurons/heads) applied
  • healing via extended fine-tuning

System Optimization

  • use llama.cpp runtime for lightweight execution

Training Optimization

  • LoRA
  • Instruction tuning on OpenHermes-2.5

Inference Optimization

  • 4-bit quantization to gguf
  • LoRA

Reproducibility

Data Urls

  • fineweb, fineweb-edu, OpenHermes-2.5 (publicly referenced)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Width-wise pruning caused large capability loss; authors advise limit ≈30% parameter removal.
  • General language understanding degraded on some benchmarks after heavy pruning and fine-tuning.
  • No public code release reported, which limits reproduction.
  • Safety evaluation is limited to negative samples and human fluency checks, not systematic robustness tests.

When Not To Use

  • If you need full-size LLM capabilities for broad knowledge or reasoning tasks.
  • For safety-critical vehicle controls until formal safety and verification are complete.
  • When you can afford cloud/accelerator deployment and prefer latest large models.

Failure Modes

  • Model collapse if >30% of layers removed without extensive healing.
  • Loss of factual knowledge after short healing; requires long healing (15B tokens) to recover.
  • CPU usage spikes (up to 400% across cores) during inference on head unit.

Core Entities

Models

  • Phi-3 mini (3.8B)
  • Phi3-2.8B (depth-pruned)
  • Phi3-1.8B (depth+width-pruned)
  • LoRA
  • Octopus v2 (inspiration)

Metrics

  • function-calling exact-match
  • MMLU score
  • TruthfulQA score
  • HellaSwag score
  • tokens per second (t/s)
  • memory (GB)

Datasets

  • fineweb
  • fineweb-edu
  • OpenHermes-2.5 (instruction tuning)
  • synthetic in-vehicle function-calling dataset (25k pos, 500 neg)

Benchmarks

  • MMLU
  • HellaSwag
  • TruthfulQA
  • Winogrande
  • ARC

Context Entities

Models

  • Gemma (2B)
  • Mistral (7B)
  • Llama-3 (8B)

Datasets

  • fineweb-edu (healing)
  • OpenHermes-2.5 (instruction)