Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
TinyAgent shows you can run private, low-latency assistant features on-device with small models that match cloud performance on task-specific API orchestration.
Summary TLDR
TinyAgent is an end-to-end workflow for training small language models (1.1B and 7B) to generate function-calling plans and run locally on a MacBook. Key steps: synthesize a high-quality function-calling dataset (80K train, 1K/1K val/test), fine-tune SLMs with LoRA, use a classifier-based Tool RAG to include only relevant API descriptions in prompts, and quantize to 4-bit for faster on-device inference. Final models reach ~80–85% function-calling success on a Mac assistant benchmark, comparable to or above GPT‑4‑Turbo on this task. The dataset, models, and installer are open-sourced.
Problem Statement
Large cloud LLMs can orchestrate APIs but are too big for private, offline, low-latency use on devices. Off-the-shelf small models lack reliable function-calling and orchestration. The question: can small, task-specialized models run on-device and match large models' function-calling ability?
Main Contribution
A pipeline to teach small LLMs to produce function-calling plans using the LLMCompiler planner.
A curated function-calling dataset: 80K training, 1K validation, 1K test (synthesized and sanity-checked).
Fine-tuned TinyAgent models (1.1B and 7B) that achieve ~80–85% success on a Mac assistant benchmark.
Tool RAG: a DeBERTa-v3-small classifier that selects relevant tools, halving prompt size and keeping recall near 1.0.
4-bit quantization plus quantization-aware fine-tuning to cut model size and latency for real-time on-device use.
Open-source release and a working MacBook demo with local audio (Whisper) and function execution.
Key Findings
Fine-tuning small models on curated function-calling data yields large gains.
A small 7B model can match or exceed a cloud model on this task.
Classifier-based Tool RAG dramatically cuts prompt size while preserving tool recall.
4-bit quantization reduces model footprint and speeds up inference.
Synthetic dataset was inexpensive to produce and sanity-checked.
Results
Function-calling success rate (fine-tuned)
Comparison vs GPT-4-Turbo
Tool selection and prompt tokens
Quantization: size & latency
Who Should Care
What To Try In 7 Days
Synthesize a small function-calling dataset for your app and fine-tune a 1–7B model with LoRA.
Train a simple classifier to select relevant APIs and drop unused tool descriptions from prompts.
Quantize your fine-tuned model to 4-bit and test end-to-end latency on target devices.
Agent Features
Memory
- In-context examples retrieved via RAG (short-term context)
Planning
- Function calling plan (sequence of API calls)
- DAG-based orchestration (dependencies as DAG nodes/edges)
Tool Use
- Function calling / API invocation
- Parallel/function orchestration via LLMCompiler
Frameworks
- LLMCompiler
- LoRA
- llama.cpp
- whisper.cpp
- DeBERTa-v3-small for Tool RAG
Is Agentic
true
Architectures
- Sequence-to-plan LLM (LLMCompiler planner)
- Classifier-based Tool RAG
Collaboration
- Human oversight recommended for critical actions
Optimization Features
Token Efficiency
- Prompt token reduction ~2x using Tool RAG
Infra Optimization
- Deploy on MacBook Pro M3; use llama.cpp for efficient local inference
Model Optimization
- Post-training quantization to 4-bit (group size 32)
- Quantization-aware fine-tuning
System Optimization
- Local audio processing with Whisper-v3 and whisper.cpp
Training Optimization
- LoRA
- Synthetic data generation with sanity checks
Inference Optimization
- Tool RAG classifier to reduce prompt length
- Reduced token context to lower attention cost
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Evaluation limited to a Mac assistant with 16 predefined tools; cross-platform generality is untested
- Dataset is synthetic and may carry cultural or distributional bias
- Performance depends on fixed, pre-defined APIs; adding new APIs requires dataset + classifier updates
When Not To Use
- Tasks that require broad, open-domain world knowledge beyond provided APIs
- High-stakes decisions without human review
- Environments with many unseen or rapidly changing APIs
Failure Modes
- Hallucinated or wrong function names/arguments leading to failed API calls
- Missing auxiliary tools if classifier threshold misses required tools
- Degraded performance on user queries diverging from synthetic training styles
Core Entities
Models
- TinyAgent-1.1B
- TinyAgent-7B
- TinyLlama-1.1B
- Wizard-2-7B
- GPT-4-Turbo
- GPT-3.5
- DeBERTa-v3-small
- Whisper-v3
Metrics
- Function-calling success rate (DAG isomorphism)
- Tool recall
- Prompt size (tokens)
- End-to-end latency (seconds)
- Model size (GB)
Datasets
- TinyAgent function-calling dataset (80K train / 1K val / 1K test)
Benchmarks
- Mac assistant function-calling benchmark (1K test)
Context Entities
Models
- LLaMA-2 70B (referenced)
- Gorilla, ToolFormer, Octopus (related tool frameworks)

