Skip to content

LiliGasser/adore

Repository files navigation

ADORE: A benchmark dataset for machine learning in ecotoxicology

The ADORE dataset serves as a benchmark to explore the potential and limitations of machine learning in ecotoxicology. It focuses on acute mortality in fish, crustaceans, and algae.

It has been published as a Nature data descriptor.

A. Project description

The ADORE dataset was created to drive ML research in the field of ecotoxicology.

ADORE is focused on acute mortality of fish, crustaceans, and algae. The core of ADORE is extracted from the ECOTOX database (September 2022 release), which was extended with chemical and taxonomic information. The complete dataset contains 33k data points of which the majority (26k) are on fish. Only tests where acute mortality is measured as effective concentration 50 (EC50) were retained. Check the Nature data paper for a detailed description.

This repository contains the benchmark datasets, the raw data, and the code to generate it from raw data.

The dataset was created as part of the SDSC project MLTox: Enhancing Toxicological Testing through Machine Learning. It is also available on ERIC, the institutional data repository of Eawag.

B. Getting started

This repository only contains the data. To view and start modeling, head over to the modeling repository ADORE-modeling.

For recreating the data set, follow these steps:

  1. Clone the repository.

This step needs an ssh connection.

git clone git@github.com:LiliGasser/adore.git
cd adore/
  1. Install git LFS.

The data files are stored as Large File Storage (LFS) files. This step can take a few minutes.

git lfs install --local
git lfs pull -I "data/raw/*"
  1. Create a conda environment.

This command installs the environment directly in the project folder using the provided environment.yml.

conda env create --prefix ./conda-env --file ./environment.yml
conda activate ./conda-env

OR

If you prefer mamba:

mamba env create --prefix ./conda-env --file ./environment.yml
conda activate ./conda-env

C. Example / Usage

Recreate the data

Run these three scripts to recreate the data. This may take a few minutes, especially for the first and the third script.

python scripts/01_preprocessing_rawdata.py
python scripts/02_preprocessing_filtering.py
python scripts/03_splits.py

The output is stored in data in the subfolders processed, chemicals, and taxonomy. The final challenge datasets are stored in data/processed. We included intermediate files to allow following the processing steps.

Data generation

  1. The first script harmonizes and pre-filters the ECOTOX data, to which then taxonomic and chemical data is added. After filtering for acute mortality, the intermediate file ecotox_mortality_processed.csv is stored. Additionally, the processed separate files from ECOTOX are stored in data/processed/before_aggregation:
  • species: ecotox_species.csv
  • tests: ecotox_tests.csv
  • results: ecotox_results.csv
  • compiled chemical properties: ecotox_properties.csv
  1. In the second script, the intermediate file is further processed and filtered to ecotox_mortality_filtered.csv.

  2. In the third script, we generate the 11 challenge data subsets with the data splits.

Challenge datasets

The ADORE dataset is split in several benchmark datasets, called challenges. For each challenge, we provide a file whose name starts with the challenge, e.g., a-F2F and ends with _mortality.csv. These datasets contain all information needed for modeling as well as columns to uniquely identify and retrace each entry.

We retain the ECOTOX test and result id for each entry to check entries against ECOTOX. Please open an issue if you find a mistake.

Additional output data on chemicals and taxonomy

Additionally, we provide separate files in the folder chemicals.

  • The chemical ontology (classyfire_output.csv) and the functional use categories (functional_uses_output.csv) can be matched with the data in the challenges files using the InChIkey or the DTXSID, respectively.
  • Three files are explaining the bits of the MACCS, PubChem and ToxPrint (maccs_bits.tsv, pubchem_bits.csv and toxprint_bits.csv).

In the folder taxonomy

  • the file tax_pdm_species contains phylogenetic distance information which can be used for modeling.

D. Renku

All scripts were run with the python scripts/<script-name> command which automatically tracks the workflow as the files contain Renku code snippets whenever a file is read or written. The main reading functions are wrappers around the corresponding pandas functions and can be found in src/utils.py. The reading functions rely on the Renku Input class whereas files are being written with the Output class.

See Figure 1 for a conceptual overview and Figure 2 for the corresponding Renku workflow.

Figure 1: drawing

Figure 2: drawing

Changelog

February 13, 2025: The challenge datasets were updated to be split by canonical SMILES instead of CAS number. This ensures that the training test splits only contain chemicals with the same canonical SMILES. The previous split by CAS number led to some chemicals with the same canonical SMILES in both the training and test set. This update led to no relevant change in modeling results.

Licence

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

About

A benchmark Dataset fOR machine learning in Ecotoxicology

Topics

Resources

License

Stars

Watchers

Forks