The ADORE dataset serves as a benchmark to explore the potential and limitations of machine learning in ecotoxicology. It focuses on acute mortality in fish, crustaceans, and algae.
It has been published as a Nature data descriptor.
The ADORE dataset was created to drive ML research in the field of ecotoxicology.
ADORE is focused on acute mortality of fish, crustaceans, and algae. The core of ADORE is extracted from the ECOTOX database (September 2022 release), which was extended with chemical and taxonomic information. The complete dataset contains 33k data points of which the majority (26k) are on fish. Only tests where acute mortality is measured as effective concentration 50 (EC50) were retained. Check the Nature data paper for a detailed description.
This repository contains the benchmark datasets, the raw data, and the code to generate it from raw data.
The dataset was created as part of the SDSC project MLTox: Enhancing Toxicological Testing through Machine Learning. It is also available on ERIC, the institutional data repository of Eawag.
This repository only contains the data. To view and start modeling, head over to the modeling repository ADORE-modeling.
For recreating the data set, follow these steps:
- Clone the repository.
This step needs an ssh connection.
git clone git@github.com:LiliGasser/adore.git
cd adore/
- Install git LFS.
The data files are stored as Large File Storage (LFS) files. This step can take a few minutes.
git lfs install --local
git lfs pull -I "data/raw/*"
- Create a conda environment.
This command installs the environment directly in the project folder using the provided environment.yml.
conda env create --prefix ./conda-env --file ./environment.yml
conda activate ./conda-env
OR
If you prefer mamba:
mamba env create --prefix ./conda-env --file ./environment.yml
conda activate ./conda-env
Run these three scripts to recreate the data. This may take a few minutes, especially for the first and the third script.
python scripts/01_preprocessing_rawdata.py
python scripts/02_preprocessing_filtering.py
python scripts/03_splits.py
The output is stored in data in the subfolders processed, chemicals, and taxonomy. The final challenge datasets are stored in data/processed. We included intermediate files to allow following the processing steps.
- The first script harmonizes and pre-filters the ECOTOX data, to which then taxonomic and chemical data is added. After filtering for acute mortality, the intermediate file
ecotox_mortality_processed.csvis stored. Additionally, the processed separate files from ECOTOX are stored indata/processed/before_aggregation:
- species:
ecotox_species.csv - tests:
ecotox_tests.csv - results:
ecotox_results.csv - compiled chemical properties:
ecotox_properties.csv
-
In the second script, the intermediate file is further processed and filtered to
ecotox_mortality_filtered.csv. -
In the third script, we generate the 11 challenge data subsets with the data splits.
The ADORE dataset is split in several benchmark datasets, called challenges. For each challenge, we provide a file whose name starts with the challenge, e.g., a-F2F and ends with _mortality.csv. These datasets contain all information needed for modeling as well as columns to uniquely identify and retrace each entry.
We retain the ECOTOX test and result id for each entry to check entries against ECOTOX. Please open an issue if you find a mistake.
Additionally, we provide separate files in the folder chemicals.
- The chemical ontology (
classyfire_output.csv) and the functional use categories (functional_uses_output.csv) can be matched with the data in the challenges files using the InChIkey or the DTXSID, respectively. - Three files are explaining the bits of the MACCS, PubChem and ToxPrint (
maccs_bits.tsv,pubchem_bits.csvandtoxprint_bits.csv).
In the folder taxonomy
- the file
tax_pdm_speciescontains phylogenetic distance information which can be used for modeling.
All scripts were run with the python scripts/<script-name> command which automatically tracks the workflow as the files contain Renku code snippets whenever a file is read or written. The main reading functions are wrappers around the corresponding pandas functions and can be found in src/utils.py. The reading functions rely on the Renku Input class whereas files are being written with the Output class.
See Figure 1 for a conceptual overview and Figure 2 for the corresponding Renku workflow.
February 13, 2025: The challenge datasets were updated to be split by canonical SMILES instead of CAS number. This ensures that the training test splits only contain chemicals with the same canonical SMILES. The previous split by CAS number led to some chemicals with the same canonical SMILES in both the training and test set. This update led to no relevant change in modeling results.
This work is licensed under a Creative Commons Attribution 4.0 International License.


