One-click evaluation, automated ensembles, and LLM-powered Q&A for time series forecasting

December 23, 20246 min

Overview

Decision SnapshotNeeds Validation

System demo built on an existing open benchmark with concrete modules, but no end-to-end production deployment evidence is provided.

Citations0

Evidence Strength0.60

Confidence0.70

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Xiangfei Qiu, Xiuwen Li, Ruiyang Pang, Zhicheng Pan, Xingjian Wu, Liu Yang, Jilin Hu, Yang Shu, Xuesong Lu, Chengcheng Yang, Chenjuan Guo, Aoying Zhou, Christian S. Jensen, Bin Yang

Links

Abstract / PDF / Data

Why It Matters For Business

EasyTime speeds method evaluation and selection by reusing a large benchmark and automating ensembles, reducing experiment time and guesswork for forecasting projects.

Who Should Care

Summary TLDR

EasyTime is a demo system that makes time-series forecasting easier for researchers and practitioners. It builds on the TFB benchmark (25 multivariate and 8,068 univariate datasets) and 30+ methods to provide: one-click evaluation for new algorithms, an Automated Ensemble that recommends and combines top methods using TS2Vec + a method-ranking classifier, and a natural-language Q&A module that converts questions to SQL, retrieves benchmark results, and returns answers, charts, and SQL. The system is aimed at faster evaluation, easier model selection, and interactive exploration of historical benchmarking evidence.

Problem Statement

Researchers and practitioners face three pain points: (1) evaluating forecasting methods comprehensively is time-consuming and error-prone; (2) picking suitable methods for a new dataset is hard because no single method wins everywhere; (3) querying benchmarking results or getting practical guidance requires technical effort or expert knowledge.

Main Contribution

A one-click evaluation layer that runs a method across TFB’s diverse datasets and standardized pipelines.

An Automated Ensemble module that uses TS2Vec features and a pre-trained classifier (soft-label loss) to recommend top-k methods and learn ensemble weights on the target data.

Key Findings

TFB contains broad data and precomputed results.

Numbers25 multivariate datasets; 8,068 univariate datasets; 8,000+ series with results

Practical UseYou can evaluate methods across many domains without building datasets from scratch.

Evidence RefSection II.A

Benchmark includes many methods and accumulated results.

Numbers30+ TSF methods evaluated on 8,000+ time series

Practical UseUse the benchmark results as a knowledge base to guide method choice and form ensembles.

Evidence RefAbstract, II.A

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Coverage30+ methods evaluatedTFB (8,000+ series)Benchmark results collected from evaluating 30+ methods on 8,000+ time seriesII.A
Dataset counts25 multivariate; 8,068 univariateTFBData layer includes 25 multivariate and 8,068 univariate datasetsII.A

What To Try In 7 Days

Embed one of your models into the TFB pipeline and run one-click evaluation.

Upload a real dataset and click 'Recommend Method' to see top-k candidates.

Try the 'AutoML' ensemble flow to compare ensemble vs single methods on your data.

Agent Features

Memory
benchmark knowledge base
Tool Use
LLM for NL2SQLSQL database for retrievalTS2Vec for feature extraction
Frameworks
TFBDartsTSLib
Architectures
offline pretraining + online inference

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Quality of recommendations depends on TFB’s dataset coverage; out-of-distribution datasets may not be well served.

Automated Ensemble relies on classifier trained on historical results and may misrank novel-method behaviors.

When Not To Use

When your dataset has patterns not represented in TFB and bespoke modeling is required.

When strict, validated probabilistic forecasting guarantees are required beyond empirical ensembles.

Failure Modes

Classifier recommends poor models for truly novel datasets.

Generated SQL is incorrect and returns misleading data if verification fails.

Core Entities

Models

TS2VecNLinearPatchTSTFiLMTimesNetDLinearLinearMICN

Metrics

MAEcustom metrics (supported)

Datasets

TFBtrafficelectricityenergyenvironmentnatureeconomicstockbankinghealthweb

Benchmarks

TFB