Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
2
Why It Matters For Business
You can run a function-calling LLM on existing vehicle CPUs with small memory and faster integration to new features, cutting cloud costs and reducing latency while improving function-call accuracy over a production speech baseline.
Summary TLDR
The authors show a practical pipeline to shrink and adapt Microsoft Phi-3 mini (3.8B) for in-vehicle function-calling. They remove up to ~2B parameters via structured depth and width pruning, restore capabilities with extended "healing" fine-tuning, add special tokens that map outputs to gRPC vehicle functions, and convert the model to a 4-bit gguf artifact for llama.cpp. Result: function-calling accuracy 0.84–0.88 on their tests, <2GB RAM footprint, and up to 11 tokens/sec on CPU for the 1.8B variant — enabling on-device inference without accelerators.
Problem Statement
Modern vehicles run on many control units and custom APIs. Rule-based integrations are brittle and costly to extend. Small Language Models (SLMs) can simplify integration by mapping natural language to vehicle function calls, but vehicle head units have tight memory, CPU, and latency limits. The paper asks: how to compress and fine-tune an SLM so it fits automotive hardware and reliably triggers vehicle functions?
Main Contribution
A practical pipeline combining structured pruning, extended healing (fine-tuning), and instruction alignment to shrink Phi-3 mini for vehicles.
A synthetic dataset and special-token scheme that maps language outputs to gRPC-based vehicle functions for function-calling.
A deployment path using LoRA/FFT merge, convert-to-gguf, 4-bit quantization, and llama.cpp to run on vehicle head units with <2GB RAM and real-time speeds.
Key Findings
You can remove roughly 2 billion parameters from Phi-3 mini while keeping task accuracy for function-calling.
Fine-tuned, pruned models achieve high function-calling accuracy on the authors' tests.
A pruned, quantized 1.8B model runs at about 11 tokens per second on CPU without hardware acceleration.
4-bit gguf conversion fits the pruned model into a small memory budget.
Depth-wise (layer/block) pruning is much less harmful than aggressive width-wise pruning.
Results
Accuracy
MMLU (4-bit gguf)
Token generation speed (t/s) on CPU
Who Should Care
What To Try In 7 Days
Prototype with Phi-3 mini: depth prune a contiguous block (n=8) and run short healing.
Create a small synthetic function dataset and add special tokens for function names.
Convert a healed model to gguf, quantize to 4-bit, and test tokens/sec and function exact-match on the target head unit.
Agent Features
Tool Use
- function-calling via gRPC
- special-token -> API mapping
Frameworks
- llama.cpp
- ggml/gguf
- LoRA
Is Agentic
true
Architectures
- decoder-only transformer
Optimization Features
Token Efficiency
- depth-pruned model: ≈2x token throughput vs width-pruned
Infra Optimization
- design for multi-core ARM CPU without accelerators
Model Optimization
- depth-wise (block) structured pruning
- width-wise pruning (neurons/heads) applied
- healing via extended fine-tuning
System Optimization
- use llama.cpp runtime for lightweight execution
Training Optimization
- LoRA
- Instruction tuning on OpenHermes-2.5
Inference Optimization
- 4-bit quantization to gguf
- LoRA
Reproducibility
Data Urls
- fineweb, fineweb-edu, OpenHermes-2.5 (publicly referenced)
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Width-wise pruning caused large capability loss; authors advise limit ≈30% parameter removal.
- General language understanding degraded on some benchmarks after heavy pruning and fine-tuning.
- No public code release reported, which limits reproduction.
- Safety evaluation is limited to negative samples and human fluency checks, not systematic robustness tests.
When Not To Use
- If you need full-size LLM capabilities for broad knowledge or reasoning tasks.
- For safety-critical vehicle controls until formal safety and verification are complete.
- When you can afford cloud/accelerator deployment and prefer latest large models.
Failure Modes
- Model collapse if >30% of layers removed without extensive healing.
- Loss of factual knowledge after short healing; requires long healing (15B tokens) to recover.
- CPU usage spikes (up to 400% across cores) during inference on head unit.
Core Entities
Models
- Phi-3 mini (3.8B)
- Phi3-2.8B (depth-pruned)
- Phi3-1.8B (depth+width-pruned)
- LoRA
- Octopus v2 (inspiration)
Metrics
- function-calling exact-match
- MMLU score
- TruthfulQA score
- HellaSwag score
- tokens per second (t/s)
- memory (GB)
Datasets
- fineweb
- fineweb-edu
- OpenHermes-2.5 (instruction tuning)
- synthetic in-vehicle function-calling dataset (25k pos, 500 neg)
Benchmarks
- MMLU
- HellaSwag
- TruthfulQA
- Winogrande
- ARC
Context Entities
Models
- Gemma (2B)
- Mistral (7B)
- Llama-3 (8B)
Datasets
- fineweb-edu (healing)
- OpenHermes-2.5 (instruction)

