TinyAgent — small on-device LLM agents that call functions and match GPT‑4‑Turbo on tool orchestration

September 1, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami

Links

Abstract / PDF

Why It Matters For Business

TinyAgent shows you can run private, low-latency assistant features on-device with small models that match cloud performance on task-specific API orchestration.

Summary TLDR

TinyAgent is an end-to-end workflow for training small language models (1.1B and 7B) to generate function-calling plans and run locally on a MacBook. Key steps: synthesize a high-quality function-calling dataset (80K train, 1K/1K val/test), fine-tune SLMs with LoRA, use a classifier-based Tool RAG to include only relevant API descriptions in prompts, and quantize to 4-bit for faster on-device inference. Final models reach ~80–85% function-calling success on a Mac assistant benchmark, comparable to or above GPT‑4‑Turbo on this task. The dataset, models, and installer are open-sourced.

Problem Statement

Large cloud LLMs can orchestrate APIs but are too big for private, offline, low-latency use on devices. Off-the-shelf small models lack reliable function-calling and orchestration. The question: can small, task-specialized models run on-device and match large models' function-calling ability?

Main Contribution

A pipeline to teach small LLMs to produce function-calling plans using the LLMCompiler planner.

A curated function-calling dataset: 80K training, 1K validation, 1K test (synthesized and sanity-checked).

Fine-tuned TinyAgent models (1.1B and 7B) that achieve ~80–85% success on a Mac assistant benchmark.

Tool RAG: a DeBERTa-v3-small classifier that selects relevant tools, halving prompt size and keeping recall near 1.0.

4-bit quantization plus quantization-aware fine-tuning to cut model size and latency for real-time on-device use.

Open-source release and a working MacBook demo with local audio (Whisper) and function execution.

Key Findings

Fine-tuning small models on curated function-calling data yields large gains.

NumbersTinyLlama-1.1B: 12.71% -> 78.89% success (after LoRA fine-tune)

A small 7B model can match or exceed a cloud model on this task.

NumbersTinyAgent-7B: 83.09% (fine-tuned) vs GPT-4‑Turbo: 79.08% on the same benchmark

Classifier-based Tool RAG dramatically cuts prompt size while preserving tool recall.

NumbersAverage tools 3.97; recall 0.998; prompt tokens reduced ~2762 -> 1397 (~2x)

4-bit quantization reduces model footprint and speeds up inference.

Numbers4x model size reduction and ~30% latency improvement after 4-bit quantization

Synthetic dataset was inexpensive to produce and sanity-checked.

Numbers80K train / 1K val / 1K test dataset cost ≈ $500 to synthesize

Results

Function-calling success rate (fine-tuned)

ValueTinyAgent-1.1B: 78.89% (after LoRA); TinyAgent-7B: 83.09% (after LoRA)

BaselineTinyLlama-1.1B off-the-shelf 12.71%; Wizard-2-7B off-the-shelf 41.25%

Comparison vs GPT-4-Turbo

ValueTinyAgent-7B: 83.09% vs GPT-4-Turbo: 79.08%

BaselineGPT-4-Turbo

Tool selection and prompt tokens

ValueAverage tools per query: 3.97; prompt tokens: ~1397 (with DeBERTa Tool RAG)

BaselineBasic RAG: top-3/top-6 retrieval; prompt tokens ~2762 / 1674

Quantization: size & latency

Value4-bit quantization: ~4x model size reduction and ~30% faster latency

Baseline16-bit weights

Who Should Care

What To Try In 7 Days

Synthesize a small function-calling dataset for your app and fine-tune a 1–7B model with LoRA.

Train a simple classifier to select relevant APIs and drop unused tool descriptions from prompts.

Quantize your fine-tuned model to 4-bit and test end-to-end latency on target devices.

Agent Features

Memory

  • In-context examples retrieved via RAG (short-term context)

Planning

  • Function calling plan (sequence of API calls)
  • DAG-based orchestration (dependencies as DAG nodes/edges)

Tool Use

  • Function calling / API invocation
  • Parallel/function orchestration via LLMCompiler

Frameworks

  • LLMCompiler
  • LoRA
  • llama.cpp
  • whisper.cpp
  • DeBERTa-v3-small for Tool RAG

Is Agentic

true

Architectures

  • Sequence-to-plan LLM (LLMCompiler planner)
  • Classifier-based Tool RAG

Collaboration

  • Human oversight recommended for critical actions

Optimization Features

Token Efficiency

  • Prompt token reduction ~2x using Tool RAG

Infra Optimization

  • Deploy on MacBook Pro M3; use llama.cpp for efficient local inference

Model Optimization

  • Post-training quantization to 4-bit (group size 32)
  • Quantization-aware fine-tuning

System Optimization

  • Local audio processing with Whisper-v3 and whisper.cpp

Training Optimization

  • LoRA
  • Synthetic data generation with sanity checks

Inference Optimization

  • Tool RAG classifier to reduce prompt length
  • Reduced token context to lower attention cost

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Evaluation limited to a Mac assistant with 16 predefined tools; cross-platform generality is untested
  • Dataset is synthetic and may carry cultural or distributional bias
  • Performance depends on fixed, pre-defined APIs; adding new APIs requires dataset + classifier updates

When Not To Use

  • Tasks that require broad, open-domain world knowledge beyond provided APIs
  • High-stakes decisions without human review
  • Environments with many unseen or rapidly changing APIs

Failure Modes

  • Hallucinated or wrong function names/arguments leading to failed API calls
  • Missing auxiliary tools if classifier threshold misses required tools
  • Degraded performance on user queries diverging from synthetic training styles

Core Entities

Models

  • TinyAgent-1.1B
  • TinyAgent-7B
  • TinyLlama-1.1B
  • Wizard-2-7B
  • GPT-4-Turbo
  • GPT-3.5
  • DeBERTa-v3-small
  • Whisper-v3

Metrics

  • Function-calling success rate (DAG isomorphism)
  • Tool recall
  • Prompt size (tokens)
  • End-to-end latency (seconds)
  • Model size (GB)

Datasets

  • TinyAgent function-calling dataset (80K train / 1K val / 1K test)

Benchmarks

  • Mac assistant function-calling benchmark (1K test)

Context Entities

Models

  • LLaMA-2 70B (referenced)
  • Gorilla, ToolFormer, Octopus (related tool frameworks)