Reprogram frozen LLMs to forecast time series using text prototypes and Prompt-as-Prefix

October 3, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

127

Authors

Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y. Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, Qingsong Wen

Links

Abstract / PDF

Why It Matters For Business

You can add time series forecasting to an existing LLM deployment with little extra training and often better accuracy in low-data and cross-domain cases.

Summary TLDR

This paper shows you can reuse a frozen large language model (LLM) for time series forecasting by converting local time windows into a small set of learned "text prototypes" and adding a structured natural-language prefix (Prompt-as-Prefix). The lightweight reprogramming network (<6.6M trainable params) trains fast and keeps the backbone LLM frozen. On standard benchmarks (ETT, Weather, Electricity, Traffic, M4) TIME-LLM matches or outperforms specialized forecasting models and recent LLM-based baselines, particularly in few-shot and zero-shot settings.

Problem Statement

Time series models are usually task-specific and need lots of data. LLMs are powerful but operate on discrete tokens and were not designed for continuous time series. The challenge is aligning continuous time series with LLMs without expensive fine-tuning of the backbone.

Main Contribution

A reprogramming method that maps time series patches into a small set of learned text prototypes so a frozen LLM can process them.

Prompt-as-Prefix (PaP): add structured natural-language context (dataset caption, task instruction, input statistics) to guide the LLM’s transformation of patches.

A lightweight pipeline (input transform + projection) that trains <6.6M params and unlocks strong few-shot and zero-shot forecasting from off-the-shelf LLMs.

Extensive evaluation across long-term, short-term, few-shot and zero-shot benchmarks showing competitive or superior performance versus specialized models.

Key Findings

TIME-LLM improves average long-term MSE over a fine-tuned LLM baseline (GPT4TS).

Numbers≈12% average MSE reduction vs GPT4TS on evaluated long-term benchmarks

TIME-LLM yields state-of-the-art results in most long-term cases.

NumbersSOTA in 36 out of 40 long-term instances (across eight benchmarks)

Strong few-shot and zero-shot performance under data scarcity.

Numbers≈5% MSE reduction vs GPT4TS in 10% few-shot; up to 22% improvement in zero-shot vs GPT4TS

The trainable reprogramming module is compact.

Numbers<6.6M trainable params (~0.2% of Llama-7B); reprogramming module size reported 6.39M

Both patch reprogramming and prompt-as-prefix are necessary for best results.

NumbersAblation: removing reprogramming or PaP increases MSE by >8–10% on average in some settings

Results

Long-term MSE (vs GPT4TS)

Value≈12% average reduction

BaselineGPT4TS

Short-term SMAPE (M4 benchmark)

Value11.983 (TIME-LLM)

BaselineGPT4TS 12.690

Trainable parameters added

Value6.39M (reprogramming module)

Baselinefull Llama-7B fine-tuning

Who Should Care

What To Try In 7 Days

Run the TIME-LLM code on one of your standard forecasting datasets (e.g., hourly electricity) using a frozen Llama-7B backbone.

Replace cluster-dependent fine-tuning with the small reprogramming module (<7M params) to test accuracy and training speed.

Craft Prompt-as-Prefix entries: dataset caption, task instruction, and simple input statistics (trend, top lags) and measure lift versus no-prompt.

Optimization Features

Token Efficiency

  • Input patching reduces token sequence length compared with raw time steps

Infra Optimization

  • LoRA

Model Optimization

  • Keep backbone frozen; only small adapter-like module trained

System Optimization

  • Compatible with off-the-shelf quantization and PEFT techniques

Training Optimization

  • Train <6.6M params; batch sizes and epochs small (see Tab.9)

Inference Optimization

  • Inference cost dominated by backbone LLM; can apply quantization

Reproducibility

Data Urls

  • ETT, M4 and other public benchmarks referenced in paper (see Sec.B.2)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Depends on access to large pretrained LLMs; backbone inference cost remains high.
  • Numeric precision and tokenization quirks of LLMs may complicate direct numeric outputs; the method avoids this by projecting LLM outputs but risk remains.
  • Evaluations are on canonical public benchmarks; performance on domain-specific, high-volatility series (extreme events) is not deeply explored.

When Not To Use

  • When strict low-latency or very low-cost inference is required and you cannot host a large LLM.
  • If you cannot access a suitable backbone LLM or must avoid proprietary/model-license constraints.
  • When you require fully interpretable classical models for regulatory reasons.

Failure Modes

  • Backbone LLM generates representations insensitive to fine numeric detail, hurting long-horizon precision if projection is misaligned.
  • Domain shift: prototypes learned on one domain may not generalize to very different series without additional data.
  • Outliers or extreme events may be poorly captured if prompts or prototypes do not include relevant context.

Core Entities

Models

  • Llama-7B
  • GPT-2
  • GPT4TS
  • LLMTime
  • PatchTST
  • TimesNet
  • DLinear
  • N-HiTS
  • N-BEATS
  • FEDformer
  • Autoformer

Metrics

  • MSE
  • MAE
  • SMAPE
  • MASE
  • OWA

Datasets

  • ETTh1
  • ETTh2
  • ETTm1
  • ETTm2
  • Weather
  • Electricity
  • Traffic
  • ILI
  • M4
  • M3-Quarterly

Benchmarks

  • long-term forecasting
  • short-term forecasting
  • few-shot forecasting
  • zero-shot forecasting