Overview
The paper demonstrates a full pipeline and an on-device demo with concrete numbers (success rates, recall, latency) on a Mac-specific assistant; results are strong for this scope but generalization beyond the tested tool set and platform is not shown.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
TinyAgent shows you can run private, low-latency assistant features on-device with small models that match cloud performance on task-specific API orchestration.
Who Should Care
Summary TLDR
TinyAgent is an end-to-end workflow for training small language models (1.1B and 7B) to generate function-calling plans and run locally on a MacBook. Key steps: synthesize a high-quality function-calling dataset (80K train, 1K/1K val/test), fine-tune SLMs with LoRA, use a classifier-based Tool RAG to include only relevant API descriptions in prompts, and quantize to 4-bit for faster on-device inference. Final models reach ~80–85% function-calling success on a Mac assistant benchmark, comparable to or above GPT‑4‑Turbo on this task. The dataset, models, and installer are open-sourced.
Problem Statement
Large cloud LLMs can orchestrate APIs but are too big for private, offline, low-latency use on devices. Off-the-shelf small models lack reliable function-calling and orchestration. The question: can small, task-specialized models run on-device and match large models' function-calling ability?
Main Contribution
A pipeline to teach small LLMs to produce function-calling plans using the LLMCompiler planner.
A curated function-calling dataset: 80K training, 1K validation, 1K test (synthesized and sanity-checked).
Key Findings
Fine-tuning small models on curated function-calling data yields large gains.
A small 7B model can match or exceed a cloud model on this task.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Function-calling success rate (fine-tuned) | TinyAgent-1.1B: 78.89% (after LoRA); TinyAgent-7B: 83.09% (after LoRA) | TinyLlama-1.1B off-the-shelf 12.71%; Wizard-2-7B off-the-shelf 41.25% | 1.1B: +66.18 pp; 7B: +41.84 pp | TinyAgent function-calling test set (1K) | Section 3.3 | Section 3.3 |
| Comparison vs GPT-4-Turbo | TinyAgent-7B: 83.09% vs GPT-4-Turbo: 79.08% | GPT-4-Turbo | +4.01 pp | Mac assistant benchmark | Section 3.3 and Conclusion | Conclusion |
What To Try In 7 Days
Synthesize a small function-calling dataset for your app and fine-tune a 1–7B model with LoRA.
Train a simple classifier to select relevant APIs and drop unused tool descriptions from prompts.
Quantize your fine-tuned model to 4-bit and test end-to-end latency on target devices.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Evaluation limited to a Mac assistant with 16 predefined tools; cross-platform generality is untested
Dataset is synthetic and may carry cultural or distributional bias
When Not To Use
Tasks that require broad, open-domain world knowledge beyond provided APIs
High-stakes decisions without human review
Failure Modes
Hallucinated or wrong function names/arguments leading to failed API calls
Missing auxiliary tools if classifier threshold misses required tools

