ADORE: A benchmark dataset for machine learning in ecotoxicology

The ADORE dataset serves as a benchmark to explore the potential and limitations of machine learning in ecotoxicology. It focuses on acute mortality in fish, crustaceans, and algae.

It has been published as a Nature data descriptor.

A. Project description

The ADORE dataset was created to drive ML research in the field of ecotoxicology.

ADORE is focused on acute mortality of fish, crustaceans, and algae. The core of ADORE is extracted from the ECOTOX database (September 2022 release), which was extended with chemical and taxonomic information. The complete dataset contains 33k data points of which the majority (26k) are on fish. Only tests where acute mortality is measured as effective concentration 50 (EC50) were retained. Check the Nature data paper for a detailed description.

This repository contains the benchmark datasets, the raw data, and the code to generate it from raw data.

The dataset was created as part of the SDSC project MLTox: Enhancing Toxicological Testing through Machine Learning. It is also available on ERIC, the institutional data repository of Eawag.

B. Getting started

This repository only contains the data. To view and start modeling, head over to the modeling repository ADORE-modeling.

For recreating the data set, follow these steps:

Clone the repository.

This step needs an ssh connection.

git clone git@github.com:LiliGasser/adore.git
cd adore/

Install git LFS.

The data files are stored as Large File Storage (LFS) files. This step can take a few minutes.

git lfs install --local
git lfs pull -I "data/raw/*"

Create a conda environment.

This command installs the environment directly in the project folder using the provided environment.yml.

conda env create --prefix ./conda-env --file ./environment.yml
conda activate ./conda-env

OR

If you prefer mamba:

mamba env create --prefix ./conda-env --file ./environment.yml
conda activate ./conda-env

C. Example / Usage

Recreate the data

Run these three scripts to recreate the data. This may take a few minutes, especially for the first and the third script.

python scripts/01_preprocessing_rawdata.py
python scripts/02_preprocessing_filtering.py
python scripts/03_splits.py

The output is stored in data in the subfolders processed, chemicals, and taxonomy. The final challenge datasets are stored in data/processed. We included intermediate files to allow following the processing steps.

Data generation

The first script harmonizes and pre-filters the ECOTOX data, to which then taxonomic and chemical data is added. After filtering for acute mortality, the intermediate file ecotox_mortality_processed.csv is stored. Additionally, the processed separate files from ECOTOX are stored in data/processed/before_aggregation:

species: ecotox_species.csv
tests: ecotox_tests.csv
results: ecotox_results.csv
compiled chemical properties: ecotox_properties.csv

In the second script, the intermediate file is further processed and filtered to ecotox_mortality_filtered.csv.
In the third script, we generate the 11 challenge data subsets with the data splits.

Challenge datasets

The ADORE dataset is split in several benchmark datasets, called challenges. For each challenge, we provide a file whose name starts with the challenge, e.g., a-F2F and ends with _mortality.csv. These datasets contain all information needed for modeling as well as columns to uniquely identify and retrace each entry.

We retain the ECOTOX test and result id for each entry to check entries against ECOTOX. Please open an issue if you find a mistake.

Additional output data on chemicals and taxonomy

Additionally, we provide separate files in the folder chemicals.

The chemical ontology (classyfire_output.csv) and the functional use categories (functional_uses_output.csv) can be matched with the data in the challenges files using the InChIkey or the DTXSID, respectively.
Three files are explaining the bits of the MACCS, PubChem and ToxPrint (maccs_bits.tsv, pubchem_bits.csv and toxprint_bits.csv).

In the folder taxonomy

the file tax_pdm_species contains phylogenetic distance information which can be used for modeling.

D. Renku

All scripts were run with the python scripts/<script-name> command which automatically tracks the workflow as the files contain Renku code snippets whenever a file is read or written. The main reading functions are wrappers around the corresponding pandas functions and can be found in src/utils.py. The reading functions rely on the Renku Input class whereas files are being written with the Output class.

See Figure 1 for a conceptual overview and Figure 2 for the corresponding Renku workflow.

Figure 1:

Figure 2:

Changelog

February 13, 2025: The challenge datasets were updated to be split by canonical SMILES instead of CAS number. This ensures that the training test splits only contain chemicals with the same canonical SMILES. The previous split by CAS number led to some chemicals with the same canonical SMILES in both the training and test set. This update led to no relevant change in modeling results.

Licence

This work is licensed under a Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
.renku		.renku
data		data
figures		figures
scripts		scripts
src		src
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.renkulfsignore		.renkulfsignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ADORE: A benchmark dataset for machine learning in ecotoxicology

A. Project description

B. Getting started

C. Example / Usage

Recreate the data

Data generation

Challenge datasets

Additional output data on chemicals and taxonomy

D. Renku

Changelog

Licence

About

Uh oh!

Languages

License

LiliGasser/adore

Folders and files

Latest commit

History

Repository files navigation

ADORE: A benchmark dataset for machine learning in ecotoxicology

A. Project description

B. Getting started

C. Example / Usage

Recreate the data

Data generation

Challenge datasets

Additional output data on chemicals and taxonomy

D. Renku

Changelog

Licence

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages