APPL: a Python-native prompt language that auto-parallelizes LLM calls, traces runs, and turns functions into tools

June 19, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Honghua Dong, Qidong Su, Yubo Gao, Zhaoyu Li, Yangjun Ruan, Gennady Pekhimenko, Chris J. Maddison, Xujie Si

Links

Abstract / PDF

Why It Matters For Business

APPL reduces development time and runtime cost for LLM-driven workflows by making prompts first-class in Python, auto-parallelizing independent calls, and enabling tool integration without manual spec writing.

Summary TLDR

APPL is a small language layer that embeds natural-language prompts directly into Python functions. It makes LLM calls asynchronous by default, captures prompt context automatically, extracts tool specifications from Python functions, and records traces for replay and debugging. In practice APPL shortens code, auto-parallelizes independent LLM generations (often ~3–9× speedup in tested workflows), and simplifies building tool-using or multi-agent systems while keeping Python ergonomics.

Problem Statement

Writing maintainable programs that mix Python code and complex LLM prompts is error-prone and verbose. Developers must manually manage prompt contexts, build tool specs, handle async calls for parallelism, and reproduce runs for debugging. APPL aims to fix these frictions with a Python-native prompt language and runtime that automates context capture, parallel execution, tool integration, and tracing.

Main Contribution

APPL language: Python-native decorator (@ppl) that treats standalone expressions as prompts and exposes a gen() call for LLM generations.

Asynchronous runtime: StringFuture/BooleanFuture objects let gen() run asynchronously and synchronize only when needed, enabling automatic parallelization.

Context and tooling: Four context-passing modes (new, copy, same, resume), automatic tool specification from Python signatures/docstrings, and gen outputs that can be executed as tool calls.

Tracing and replay: Strict and non-strict tracing modes with cached responses to reproduce and debug runs without re-costing LLM calls.

Key Findings

Automatic parallelization significantly reduces wall-clock time for independent LLM calls.

NumbersCoT-SC (GPT-3.5): 27.6s → 2.9s (9.49× speedup); Table 2

Speedups are consistent across models and tasks but can be limited by memory/batch size.

NumbersCoT-SC (LLAMA-7b): 17.0s → 1.8s (9.34×); MemWalker (LLAMA-7b) speedup only 1.84× due to memory effects

APPL programs are more compact than comparable prompt languages.

NumbersCoT-SC AST-size: APPL 35 vs LMQL 57 (1.63× larger for LMQL); Table 3

Tool integration is easier when Python functions are documented.

NumbersAPPL auto-generates OpenAI-style tool JSON from function signatures/docstrings (examples shown)

Results

CoT-SC wall-clock time

Value2.9s (parallel, GPT-3.5)

Baseline27.6s (sequential, GPT-3.5)

CoT-SC wall-clock time

Value1.8s (parallel, LLAMA-7b)

Baseline17.0s (sequential, LLAMA-7b)

Skeleton-of-Thought speedup

Value2.79× (GPT-3.5)

Baselinesequential implementation

Program succinctness (AST-size)

ValueAPPL: 35 nodes

BaselineLMQL: 57 nodes

Who Should Care

What To Try In 7 Days

Install APPL and convert one prompt-heavy Python function using @ppl to see code reduction and context capture.

Profile an existing self-consistency or multi-branch pipeline and re-run it under APPL to measure wall-clock speedup.

Document a few Python helper functions and let APPL auto-generate tool specs to feed into a ReAct-style agent prototype.

Agent Features

Memory

  • per-function prompt context (convo/records)
  • resume mode for stateful agents and trace-based replay

Planning

  • automatic asynchronous scheduling of independent gen() calls
  • supports agent loops via resume context

Tool Use

  • automatic tool spec extraction from Python signatures/docstrings
  • gen outputs can be parsed and executed as tool calls

Frameworks

  • APPL integrates with OpenAI API and existing Python ecosystem

Is Agentic

true

Architectures

  • Python decorator + transpiled AST
  • asynchronous runtime with Future semantics

Collaboration

  • built-in patterns for multi-agent chat and context passing

Optimization Features

Infra Optimization

  • enables batching of parallel requests when backend supports it (depends on backend)

System Optimization

  • tracing and cache reuse to avoid re-sending LLM calls
  • AST-level transpilation to inject context with low overhead

Inference Optimization

  • automatic parallelization of independent LLM calls
  • delayed concatenation via StringFuture to avoid premature sync

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Parallel speedup depends on model/backend batching and GPU memory; long contexts can reduce gains.
  • Automatic tool spec extraction requires well-structured docstrings (Google style) to work reliably.
  • APPL transpiles Python AST and thus requires Python code flow—non-Python runtimes not supported.

When Not To Use

  • When you need tightly controlled sequential scheduling of every gen() call.
  • When running on backends that do not support parallel/batched LLM requests or have strict memory limits.
  • When you cannot provide parseable docstrings for tool specification extraction.

Failure Modes

  • StringFuture/BooleanFuture may materialize earlier than intended if non-delayed operations are invoked.
  • Traces replay in non-strict mode may substitute different but exchangeable samples; may not match original run exactly.
  • Large parallel batches can exceed memory limits leading to smaller batch sizes and lower speedups.

Core Entities

Models

  • gpt-3.5-turbo-1106
  • llama-7b

Metrics

  • wall-clock time (s)
  • speedup ratio
  • AST-size (number of AST nodes)

Datasets

  • QuALITY
  • vicuna-80 (top 20 instances)