Skip to content

Commit 5f67f46

Browse files
authored
feat: update gift eval experiment (#259)
2 parents 2fd91e6 + e4e0c3d commit 5f67f46

File tree

7 files changed

+1382
-307
lines changed

7 files changed

+1382
-307
lines changed

docs/examples/gift-eval.ipynb

Lines changed: 143 additions & 133 deletions
Large diffs are not rendered by default.

experiments/gift-eval/README.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,23 +3,24 @@
33
This section documents the evaluation of a foundation model ensemble built using the [TimeCopilot](https://timecopilot.dev) library on the [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval) benchmark.
44

55
!!! success ""
6-
With less than $30 in compute cost, TimeCopilot achieved first place in probabilistic accuracy (CRPS) among non-leaking models on this large-scale benchmark, which spans 24 datasets, 144k+ time series, and 177M data points.
6+
With less than $30 in compute cost, TimeCopilot achieved first place in probabilistic accuracy (CRPS) among open-source solution on this large-scale benchmark, which spans 24 datasets, 144k+ time series, and 177M data points.
77

88

99
TimeCopilot is an open‑source AI agent for time series forecasting that provides a unified interface to multiple forecasting approaches, from foundation models to classical statistical, machine learning, and deep learning methods, along with built‑in ensemble capabilities for robust and explainable forecasting.
1010

11-
<img width="1002" height="1029" alt="image" src="https://github.com/user-attachments/assets/6fa8d459-0ca3-45ce-afe5-7fac8400167f" />
11+
<img width="1002" height="1029" alt="image" src="https://github.com/user-attachments/assets/69724886-d37e-46e6-8a10-d82396695b49" />
12+
13+
1214

1315

1416

1517
## Description
1618

1719
This ensemble leverages [**TimeCopilot's MedianEnsemble**](https://timecopilot.dev/api/models/ensembles/#timecopilot.models.ensembles.median.MedianEnsemble) feature, which combines three state-of-the-art foundation models:
1820

19-
- [**Moirai** (Salesforce AI Research)](https://timecopilot.dev/api/models/foundation/models/#timecopilot.models.foundation.moirai.Moirai).
20-
- [**Sundial** (THUML @ Tsinghua University)](https://timecopilot.dev/api/models/foundation/models/#timecopilot.models.foundation.sundial.Sundial)
21-
- [**Toto** (DataDog)](https://timecopilot.dev/api/models/foundation/models/#timecopilot.models.foundation.toto.Toto).
22-
21+
- [**Chronos-2** (AWS)](https://timecopilot.dev/api/models/foundation/models/#timecopilot.models.foundation.chronos.Chronos).
22+
- [**TimesFM-2.5** (Google Research)](https://timecopilot.dev/api/models/foundation/models/#timecopilot.models.foundation.timesfm.TimesFM).
23+
- [**TiRex** (NXAI)](https://timecopilot.dev/api/models/foundation/models/#timecopilot.models.foundation.tirex.TiRex).
2324

2425
## Setup
2526

@@ -110,4 +111,10 @@ Results are saved to `results/timecopilot/all_results.csv` in GIFT-Eval format.
110111

111112
## Changelog
112113

113-
- **2025-08-05**: GIFT‑Eval recently [enhanced its evaluation dashboard](https://github.com/SalesforceAIResearch/gift-eval?tab=readme-ov-file#2025-08-05) with a new flag that identifies models likely affected by data leakage (i.e., having seen parts of the test set during training). While the test set itself hasn’t changed, this new insight helps us better interpret model performance. To keep our results focused on truly unseen data, we’ve excluded any flagged models from this experiment and added the Sundial model to the ensemble. The previous experiment details remain available [here](https://github.com/AzulGarza/timecopilot/tree/v0.0.14/experiments/gift-eval).
114+
### **2025-11-06**
115+
116+
We introduced newer models based on the most recent progress in the field: Chronos-2, TimesFM-2.5 and TiRex.
117+
118+
### **2025-08-05**
119+
120+
GIFT‑Eval recently [enhanced its evaluation dashboard](https://github.com/SalesforceAIResearch/gift-eval?tab=readme-ov-file#2025-08-05) with a new flag that identifies models likely affected by data leakage (i.e., having seen parts of the test set during training). While the test set itself hasn’t changed, this new insight helps us better interpret model performance. To keep our results focused on truly unseen data, we’ve excluded any flagged models from this experiment and added the Sundial model to the ensemble. The previous experiment details remain available [here](https://github.com/AzulGarza/timecopilot/tree/v0.0.14/experiments/gift-eval).

experiments/gift-eval/pyproject.toml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,13 @@
22
dependencies = [
33
"modal>=1.0.5",
44
"s3fs>=2023.12.1",
5-
"timecopilot>=0.0.13",
5+
"timecopilot>=0.0.21",
66
"transformers<4.54",
7+
"transformers==4.40.1 ; python_full_version < '3.12'",
78
"typer>=0.16.0",
89
]
910
description = "TimeCopilot experiments for GIFT-Eval"
1011
name = "timecopilot-gift-eval"
1112
readme = "README.md"
1213
requires-python = ">=3.11"
13-
version = "0.1.0"
14+
version = "0.2.0"

experiments/gift-eval/src/download_results.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,11 @@ def download_results():
1818
f"s3://{bucket}/results/timecopilot/{dataset_name}/{term}/all_results.csv"
1919
)
2020
logging.info(f"Downloading {csv_path}")
21-
df = pd.read_csv(csv_path, storage_options={"anon": False})
22-
dfs.append(df)
21+
try:
22+
df = pd.read_csv(csv_path, storage_options={"anon": False})
23+
dfs.append(df)
24+
except Exception as e:
25+
logging.error(f"Error downloading {csv_path}: {e}")
2326

2427
df = pd.concat(dfs, ignore_index=True)
2528
output_dir = Path("results/timecopilot")

experiments/gift-eval/src/run_modal.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,8 @@
3030
@app.function(
3131
image=image,
3232
volumes=volume,
33-
# 3 hours timeout
34-
timeout=60 * 60 * 3,
33+
# 6 hours timeout
34+
timeout=60 * 60 * 6,
3535
gpu="A10G",
3636
# as my local
3737
cpu=8,

experiments/gift-eval/src/run_timecopilot.py

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@
66
from timecopilot.gift_eval.eval import GIFTEval
77
from timecopilot.gift_eval.gluonts_predictor import GluonTSPredictor
88
from timecopilot.models.ensembles.median import MedianEnsemble
9-
from timecopilot.models.foundation.moirai import Moirai
10-
from timecopilot.models.foundation.sundial import Sundial
11-
from timecopilot.models.foundation.toto import Toto
9+
from timecopilot.models.foundation.chronos import Chronos
10+
from timecopilot.models.foundation.timesfm import TimesFM
11+
from timecopilot.models.foundation.tirex import TiRex
1212

1313
logging.basicConfig(level=logging.INFO)
1414

@@ -40,13 +40,15 @@ def run_timecopilot(
4040
predictor = GluonTSPredictor(
4141
forecaster=MedianEnsemble(
4242
models=[
43-
Moirai(
44-
repo_id="Salesforce/moirai-1.1-R-large",
43+
Chronos(
44+
repo_id="amazon/chronos-2",
4545
batch_size=batch_size,
4646
),
47-
Sundial(batch_size=batch_size),
48-
Toto(
49-
context_length=1_024,
47+
TimesFM(
48+
repo_id="google/timesfm-2.5-200m-pytorch",
49+
batch_size=batch_size,
50+
),
51+
TiRex(
5052
batch_size=batch_size,
5153
),
5254
],

0 commit comments

Comments
 (0)