One-click evaluation, automated ensembles, and LLM-powered Q&A for time series forecasting

December 23, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Xiangfei Qiu, Xiuwen Li, Ruiyang Pang, Zhicheng Pan, Xingjian Wu, Liu Yang, Jilin Hu, Yang Shu, Xuesong Lu, Chengcheng Yang, Chenjuan Guo, Aoying Zhou, Christian S. Jensen, Bin Yang

Links

Abstract / PDF

Why It Matters For Business

EasyTime speeds method evaluation and selection by reusing a large benchmark and automating ensembles, reducing experiment time and guesswork for forecasting projects.

Summary TLDR

EasyTime is a demo system that makes time-series forecasting easier for researchers and practitioners. It builds on the TFB benchmark (25 multivariate and 8,068 univariate datasets) and 30+ methods to provide: one-click evaluation for new algorithms, an Automated Ensemble that recommends and combines top methods using TS2Vec + a method-ranking classifier, and a natural-language Q&A module that converts questions to SQL, retrieves benchmark results, and returns answers, charts, and SQL. The system is aimed at faster evaluation, easier model selection, and interactive exploration of historical benchmarking evidence.

Problem Statement

Researchers and practitioners face three pain points: (1) evaluating forecasting methods comprehensively is time-consuming and error-prone; (2) picking suitable methods for a new dataset is hard because no single method wins everywhere; (3) querying benchmarking results or getting practical guidance requires technical effort or expert knowledge.

Main Contribution

A one-click evaluation layer that runs a method across TFB’s diverse datasets and standardized pipelines.

An Automated Ensemble module that uses TS2Vec features and a pre-trained classifier (soft-label loss) to recommend top-k methods and learn ensemble weights on the target data.

A natural-language Q&A module that turns user questions into SQL, verifies queries, retrieves benchmark data, and returns natural language answers plus charts and SQL.

An integrated demo showing how benchmark knowledge can support method selection, automated ensembles, and interactive queries.

Key Findings

TFB contains broad data and precomputed results.

Numbers25 multivariate datasets; 8,068 univariate datasets; 8,000+ series with results

Benchmark includes many methods and accumulated results.

Numbers30+ TSF methods evaluated on 8,000+ time series

Automated Ensemble uses representation learning plus a classifier to rank methods.

NumbersTS2Vec pretraining + classifier trained on 30+ methods

Q&A returns natural language, charts, and SQL for transparency.

NumbersAnswers include NL text, charts, SQL, and benchmark data table

Results

Coverage

Value30+ methods evaluated

Dataset counts

Value25 multivariate; 8,068 univariate

Q&A outputs

ValueNatural language answer + charts + SQL + data table

Who Should Care

What To Try In 7 Days

Embed one of your models into the TFB pipeline and run one-click evaluation.

Upload a real dataset and click 'Recommend Method' to see top-k candidates.

Try the 'AutoML' ensemble flow to compare ensemble vs single methods on your data.

Agent Features

Memory

  • benchmark knowledge base

Tool Use

  • LLM for NL2SQL
  • SQL database for retrieval
  • TS2Vec for feature extraction

Frameworks

  • TFB
  • Darts
  • TSLib

Architectures

  • offline pretraining + online inference

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Quality of recommendations depends on TFB’s dataset coverage; out-of-distribution datasets may not be well served.
  • Automated Ensemble relies on classifier trained on historical results and may misrank novel-method behaviors.
  • NL answers depend on LLM correctness; verification reduces but does not eliminate hallucinations.

When Not To Use

  • When your dataset has patterns not represented in TFB and bespoke modeling is required.
  • When strict, validated probabilistic forecasting guarantees are required beyond empirical ensembles.

Failure Modes

  • Classifier recommends poor models for truly novel datasets.
  • Generated SQL is incorrect and returns misleading data if verification fails.
  • Ensemble overfits small datasets when top-k candidates are too similar.

Core Entities

Models

  • TS2Vec
  • NLinear
  • PatchTST
  • FiLM
  • TimesNet
  • DLinear
  • Linear
  • MICN

Metrics

  • MAE
  • custom metrics (supported)

Datasets

  • TFB
  • traffic
  • electricity
  • energy
  • environment
  • nature
  • economic
  • stock
  • banking
  • health
  • web

Benchmarks

  • TFB