TinyAgent — small on-device LLM agents that call functions and match GPT‑4‑Turbo on tool orchestration

September 1, 20248 min

Overview

Decision SnapshotReady For Pilot

The paper demonstrates a full pipeline and an on-device demo with concrete numbers (success rates, recall, latency) on a Mac-specific assistant; results are strong for this scope but generalization beyond the tested tool set and platform is not shown.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami

Links

Abstract / PDF / Code / Data

Why It Matters For Business

TinyAgent shows you can run private, low-latency assistant features on-device with small models that match cloud performance on task-specific API orchestration.

Who Should Care

Summary TLDR

TinyAgent is an end-to-end workflow for training small language models (1.1B and 7B) to generate function-calling plans and run locally on a MacBook. Key steps: synthesize a high-quality function-calling dataset (80K train, 1K/1K val/test), fine-tune SLMs with LoRA, use a classifier-based Tool RAG to include only relevant API descriptions in prompts, and quantize to 4-bit for faster on-device inference. Final models reach ~80–85% function-calling success on a Mac assistant benchmark, comparable to or above GPT‑4‑Turbo on this task. The dataset, models, and installer are open-sourced.

Problem Statement

Large cloud LLMs can orchestrate APIs but are too big for private, offline, low-latency use on devices. Off-the-shelf small models lack reliable function-calling and orchestration. The question: can small, task-specialized models run on-device and match large models' function-calling ability?

Main Contribution

A pipeline to teach small LLMs to produce function-calling plans using the LLMCompiler planner.

A curated function-calling dataset: 80K training, 1K validation, 1K test (synthesized and sanity-checked).

Key Findings

Fine-tuning small models on curated function-calling data yields large gains.

NumbersTinyLlama-1.1B: 12.71% -> 78.89% success (after LoRA fine-tune)

Practical UseIf you fine-tune a small LLM on task-specific function-calling examples, expect a big jump in correct tool selection and orchestration; use LoRA and negative-tool samples.

Evidence RefSection 3.3

A small 7B model can match or exceed a cloud model on this task.

NumbersTinyAgent-7B: 83.09% (fine-tuned) vs GPT-4‑Turbo: 79.08% on the same benchmark

Practical UseFor focused device-control tasks, a well-finetuned 7B model can replace cloud calls and reduce privacy and latency concerns.

Evidence RefSection 3.3 and Conclusion

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Function-calling success rate (fine-tuned)TinyAgent-1.1B: 78.89% (after LoRA); TinyAgent-7B: 83.09% (after LoRA)TinyLlama-1.1B off-the-shelf 12.71%; Wizard-2-7B off-the-shelf 41.25%1.1B: +66.18 pp; 7B: +41.84 ppTinyAgent function-calling test set (1K)Section 3.3Section 3.3
Comparison vs GPT-4-TurboTinyAgent-7B: 83.09% vs GPT-4-Turbo: 79.08%GPT-4-Turbo+4.01 ppMac assistant benchmarkSection 3.3 and ConclusionConclusion

What To Try In 7 Days

Synthesize a small function-calling dataset for your app and fine-tune a 1–7B model with LoRA.

Train a simple classifier to select relevant APIs and drop unused tool descriptions from prompts.

Quantize your fine-tuned model to 4-bit and test end-to-end latency on target devices.

Agent Features

Memory
In-context examples retrieved via RAG (short-term context)
Planning
Function calling plan (sequence of API calls)DAG-based orchestration (dependencies as DAG nodes/edges)
Tool Use
Function calling / API invocationParallel/function orchestration via LLMCompiler
Frameworks
LLMCompilerLoRAllama.cppwhisper.cppDeBERTa-v3-small for Tool RAG
Is Agentic

Yes

Architectures
Sequence-to-plan LLM (LLMCompiler planner)Classifier-based Tool RAG
Collaboration
Human oversight recommended for critical actions

Optimization Features

Token Efficiency
Prompt token reduction ~2x using Tool RAG
Infra Optimization
Deploy on MacBook Pro M3; use llama.cpp for efficient local inference
Model Optimization
Post-training quantization to 4-bit (group size 32)Quantization-aware fine-tuning
System Optimization
Local audio processing with Whisper-v3 and whisper.cpp
Training Optimization
LoRASynthetic data generation with sanity checks
Inference Optimization
Tool RAG classifier to reduce prompt lengthReduced token context to lower attention cost

Reproducibility

Risks & Boundaries

Limitations

Evaluation limited to a Mac assistant with 16 predefined tools; cross-platform generality is untested

Dataset is synthetic and may carry cultural or distributional bias

When Not To Use

Tasks that require broad, open-domain world knowledge beyond provided APIs

High-stakes decisions without human review

Failure Modes

Hallucinated or wrong function names/arguments leading to failed API calls

Missing auxiliary tools if classifier threshold misses required tools

Core Entities

Models

TinyAgent-1.1BTinyAgent-7BTinyLlama-1.1BWizard-2-7BGPT-4-TurboGPT-3.5DeBERTa-v3-smallWhisper-v3

Metrics

Function-calling success rate (DAG isomorphism)Tool recallPrompt size (tokens)End-to-end latency (seconds)Model size (GB)

Datasets

TinyAgent function-calling dataset (80K train / 1K val / 1K test)

Benchmarks

Mac assistant function-calling benchmark (1K test)

Context Entities

Models

LLaMA-2 70B (referenced)Gorilla, ToolFormer, Octopus (related tool frameworks)