Derek van Tilborg, Luke Rossen, Francesca Grisoni
Corresponding author: f.grisoni@tue.nl
Abstract
Molecular machine learning models often fail to generalize beyond the chemical space of their training data, limiting their ability to reliably perform predictions on structurally novel bioactive molecules. To advance the ability of machine learning to go beyond the “edge” of their training chemical space, we introduce a joint modelling approach that combines molecular property prediction with molecular reconstruction. This approach allows the introduction of unfamiliarity, a novel reconstruction‑based metric that enables the estimation of model generalizability. Via a systematic analysis spanning more than 30 bioactivity datasets, we demonstrate that unfamiliarity not only effectively identifies out‑of‑distribution molecules but also serves as a reliable predictor of classifier performance. Even when faced with strong distribution shifts on large‑scale molecular libraries, unfamiliarity yields robust and meaningful molecular insights that go unnoticed by traditional methods. Finally, we experimentally validate unfamiliarity‑based molecule screening in the wet lab for two clinically relevant kinases, discovering seven compounds with low micromolar potency and limited similarity to training molecules. This demonstrates that unfamiliarity can extend the reach of machine learning beyond the edge of the charted chemical space, advancing the discovery of diverse and structurally novel molecules.
Figure 1. The architecture of the Joint Molecular Model (JMM) estimates how “unfamiliar” a molecule is to the model through its reconstruction loss.
The JMM codebase has been tested on macOS 15.1.1 and recent Linux distributions. It requires Python and a handful of scientific libraries. The easiest way to install the correct versions is via the supplied env.yml conda environment. The core packages used in our experiments are listed below:
- Python 3.9–3.11 (the conda environment uses Python 3.12.4)
- PyTorch 2.3.0 (with optional CUDA 11.3 for GPU support)
- RDKit 2024.3.3 (for SMILES parsing and descriptor generation)
- Scikit‑learn 1.5.1 (classical ML baselines)
- XGBoost 2.1.0 (tree‑based baselines)
- Pandas 2.2.2 and Numpy 1.26.4 (data handling and numerics)
-
Clone the repository:
git clone https://github.com/molML/JointMolecularModel.git cd JointMolecularModel -
Create and activate the conda environment. The
env.ymlfile specifies all dependencies. Creating the environment typically takes 5–10 minutes on a normal desktop machine.conda env create -f env.yml conda activate jointmolecularmodel
-
(Optional) Verify your installation by importing the main libraries:
python -c "import torch, pandas, rdkit; print(torch.__version__, pandas.__version__)"If you intend to train on a GPU, ensure that
torch.cuda.is_available()returnsTrue.
This repository is organised as follows:
- data – raw and processed datasets used in the study
- cheminformatics – scripts for generating the starting dataset
- experiments – Python scripts that implement each stage of the pipeline
- jcm – deep‑learning model definitions and utilities
- results – folders where trained models and logs are written
- plots – scripts to reproduce figures from the paper
You can cite our pre‑print as follows:
van Tilborg et al. (2025). Molecular deep learning at the edge of chemical space. ChemRxiv. DOI: 10.26434/chemrxiv-2025-qj4k3
This codebase is released under the MIT license. When using pre‑trained models or third‑party libraries, please adhere to their respective licences.
The Joint Molecular Model is research code rather than a general‑purpose software package. It has been tested on macOS 15.1.1 and recent Linux distributions, and requires Python 3.9 or newer. A CUDA‑enabled GPU with at least 8 GB of memory is recommended for training the deep‑learning models. CPU‑only training is possible but considerably slower.
| Component | Tested/required versions | Notes |
|---|---|---|
| Python | 3.9–3.11 (conda env uses 3.12.4) | Use the supplied env.yml for reproducibility |
| PyTorch | 2.3.0 | GPU support via CUDA 11.3 |
| RDKit | 2024.3.3 | Required for SMILES parsing and molecular descriptors |
| Scikit‑learn | 1.5.1 | For classical ML models (RF/MLP baselines) |
| XGBoost | 2.1.0 | Used for tree‑based baselines |
| Pandas | 2.2.2 | Tabular data handling |
| Operating system | macOS 15.1.1; Linux | The code has been tested on macOS and Linux |
For full‑scale training on the ChEMBL dataset, we recommend a CUDA‑enabled GPU (e.g., NVIDIA RTX series with ≥8 GB memory) and at least 32 GB of system RAM. Smaller datasets can be processed on a normal desktop, but training will be slower.
The typical workflow for setting up this codebase is as follows:
- Clone the repository and create the conda environment as described in the Installation section above.
- Prepare data. The repository includes cleaned and split datasets under the
data/folder. There is no automated interface for loading custom datasets; instead, the pipeline reads fixed file names fromdata/and assumes specific folder structures. If you intend to reproduce the paper’s results, no additional data preparation is required. - Run the pipeline scripts in sequence. Each script in the
experiments/directory implements a step of the workflow: data cleaning, splitting, model pre‑training, baseline training, joint training and inference. See Demo below for an overview. - (Optional) Verify your environment by running a short Python script or by training a small model. Installation typically takes only a few minutes; most of the computation time is spent during training.
This repository does not provide a single command‑line entry point or a user‑friendly demo for arbitrary datasets. Instead, the scripts in the experiments/ directory were written to replicate the experiments in our pre‑print. To reproduce those results, run the scripts in order using the included configuration files and datasets. The key steps are:
- Data preparation – run
experiments/0_clean_data.py,experiments/1_filter_chembl.pyandexperiments/2.0_split_data.pyto clean raw data, filter ChEMBL and generate train/test/OOD splits. - Model pre‑training and baselines – run
experiments/3.1_ae_pretraining.pyto pre‑train the SMILES auto‑encoder, then runexperiments/4.2_ecfp_mlp.pyandexperiments/4.3_smiles_mlp.pyto train baseline classifiers. These scripts read hyper‑parameters from YAML files underexperiments/hyperparams/. - Joint model training – run
experiments/4.4_jmm.pyto train the joint model using the pre‑trained components. - Inference – run
experiments/5.2_inference_jmm.pyto compute predictions and unfamiliarity scores on the train/test/OOD splits. The outputs (CSV files and logs) will be written to theresults/directory.
These scripts are designed for our specific datasets and do not accept arbitrary command‑line arguments. Attempting to run them on your own data without modifying the code will likely result in errors. Running the full pipeline on ChEMBL can take several hours on a single GPU; smaller datasets finish more quickly. The final outputs are CSV files containing the original SMILES, reconstructed SMILES and their edit distances, predicted labels, uncertainty measures and the unfamiliarity metric.
Because this codebase was developed for a single research project, there is no general API for applying the JMM to arbitrary datasets. All scripts assume that data live in specific folders and have specific file names. If you wish to experiment with other datasets or change hyper‑parameters, you will need to delve into the source code and configuration files. There are no ready‑made command‑line options for specifying a dataset, so such changes require code modifications.
To reproduce the experiments reported in the pre‑print:
- Install dependencies using the provided
env.ymlfile. - Run the data preparation scripts (
0_clean_data.py,1_filter_chembl.py,2.0_split_data.py) on the included datasets (Lit‑PCBA, MoleculeACE, Ames mutagenicity and ChEMBL). These scripts will clean the SMILES and create the train/test/OOD splits used in our study. - Run the training scripts (
3.1_ae_pretraining.py,4.2_ecfp_mlp.py,4.3_smiles_mlp.py,4.4_jmm.py) in sequence. Each script reads hyper‑parameters from YAML files underexperiments/hyperparams/and writes its outputs toresults/. - Evaluate models using the inference scripts (
5.2_inference_jmm.pyand related files). These will generate predictions and unfamiliarity scores and save them as CSV files. - Analyse results. Use the notebooks and plotting scripts under
results/andplots/to reproduce the figures and analyses in the paper.
By following this workflow and using the same random seeds and hyper‑parameters reported in the configuration files, you should be able to replicate the main findings of our work: unfamiliarity correlates with classifier performance across datasets and helps prioritise structurally novel yet active molecules.