TinyAgent — small on-device LLM agents that call functions and match GPT‑4‑Turbo on tool orchestration

Overview

Decision SnapshotReady For Pilot

The paper demonstrates a full pipeline and an on-device demo with concrete numbers (success rates, recall, latency) on a Mac-specific assistant; results are strong for this scope but generalization beyond the tested tool set and platform is not shown.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami

Links

Abstract / PDF / Code / Data

Why It Matters For Business

TinyAgent shows you can run private, low-latency assistant features on-device with small models that match cloud performance on task-specific API orchestration.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

TinyAgent is an end-to-end workflow for training small language models (1.1B and 7B) to generate function-calling plans and run locally on a MacBook. Key steps: synthesize a high-quality function-calling dataset (80K train, 1K/1K val/test), fine-tune SLMs with LoRA, use a classifier-based Tool RAG to include only relevant API descriptions in prompts, and quantize to 4-bit for faster on-device inference. Final models reach ~80–85% function-calling success on a Mac assistant benchmark, comparable to or above GPT‑4‑Turbo on this task. The dataset, models, and installer are open-sourced.

Problem Statement

Large cloud LLMs can orchestrate APIs but are too big for private, offline, low-latency use on devices. Off-the-shelf small models lack reliable function-calling and orchestration. The question: can small, task-specialized models run on-device and match large models' function-calling ability?

Main Contribution

A pipeline to teach small LLMs to produce function-calling plans using the LLMCompiler planner.

A curated function-calling dataset: 80K training, 1K validation, 1K test (synthesized and sanity-checked).

Key Findings

Fine-tuning small models on curated function-calling data yields large gains.

NumbersTinyLlama-1.1B: 12.71% -> 78.89% success (after LoRA fine-tune)

Practical UseIf you fine-tune a small LLM on task-specific function-calling examples, expect a big jump in correct tool selection and orchestration; use LoRA and negative-tool samples.

Evidence RefSection 3.3

A small 7B model can match or exceed a cloud model on this task.

NumbersTinyAgent-7B: 83.09% (fine-tuned) vs GPT-4‑Turbo: 79.08% on the same benchmark

Practical UseFor focused device-control tasks, a well-finetuned 7B model can replace cloud calls and reduce privacy and latency concerns.

Evidence RefSection 3.3 and Conclusion

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Function-calling success rate (fine-tuned)	TinyAgent-1.1B: 78.89% (after LoRA); TinyAgent-7B: 83.09% (after LoRA)	TinyLlama-1.1B off-the-shelf 12.71%; Wizard-2-7B off-the-shelf 41.25%	1.1B: +66.18 pp; 7B: +41.84 pp	TinyAgent function-calling test set (1K)	Section 3.3	Section 3.3
Comparison vs GPT-4-Turbo	TinyAgent-7B: 83.09% vs GPT-4-Turbo: 79.08%	GPT-4-Turbo	+4.01 pp	Mac assistant benchmark	Section 3.3 and Conclusion	Conclusion

What To Try In 7 Days

Synthesize a small function-calling dataset for your app and fine-tune a 1–7B model with LoRA.

Train a simple classifier to select relevant APIs and drop unused tool descriptions from prompts.

Quantize your fine-tuned model to 4-bit and test end-to-end latency on target devices.

Agent Features

Memory

In-context examples retrieved via RAG (short-term context)

Planning

Function calling plan (sequence of API calls)DAG-based orchestration (dependencies as DAG nodes/edges)

Tool Use

Function calling / API invocationParallel/function orchestration via LLMCompiler

Frameworks

LLMCompilerLoRAllama.cppwhisper.cppDeBERTa-v3-small for Tool RAG

Is Agentic

Yes

Architectures

Sequence-to-plan LLM (LLMCompiler planner)Classifier-based Tool RAG

Collaboration

Human oversight recommended for critical actions

Optimization Features

Token Efficiency

Prompt token reduction ~2x using Tool RAG

Infra Optimization

Deploy on MacBook Pro M3; use llama.cpp for efficient local inference

Model Optimization

Post-training quantization to 4-bit (group size 32)Quantization-aware fine-tuning

System Optimization

Local audio processing with Whisper-v3 and whisper.cpp

Training Optimization

LoRASynthetic data generation with sanity checks

Inference Optimization

Tool RAG classifier to reduce prompt lengthReduced token context to lower attention cost

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/SqueezeAILab/TinyAgent https://github.com/SqueezeAILab/TinyAgent/raw/main/TinyAgent.zip

Data URLs

https://github.com/SqueezeAILab/TinyAgent

Risks & Boundaries

Limitations

Evaluation limited to a Mac assistant with 16 predefined tools; cross-platform generality is untested

Dataset is synthetic and may carry cultural or distributional bias

When Not To Use

Tasks that require broad, open-domain world knowledge beyond provided APIs

High-stakes decisions without human review

Failure Modes

Hallucinated or wrong function names/arguments leading to failed API calls

Missing auxiliary tools if classifier threshold misses required tools

Core Entities

Models

TinyAgent-1.1BTinyAgent-7BTinyLlama-1.1BWizard-2-7BGPT-4-TurboGPT-3.5DeBERTa-v3-smallWhisper-v3

Metrics

Function-calling success rate (DAG isomorphism)Tool recallPrompt size (tokens)End-to-end latency (seconds)Model size (GB)

Datasets

TinyAgent function-calling dataset (80K train / 1K val / 1K test)

Benchmarks

Mac assistant function-calling benchmark (1K test)

Context Entities

Models

LLaMA-2 70B (referenced)Gorilla, ToolFormer, Octopus (related tool frameworks)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Fine-tuning small models on curated function-calling data yields large gains.

A small 7B model can match or exceed a cloud model on this task.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

AgentArch: benchmark of 18 agent architectures across 6 LLMs on two enterprise workflows

Key finding

Tool-R0: teach LLMs to call real tools from scratch using Generator–Solver self-play

Key finding