ToxFromSmiles Code

BiLTM models predicting endocrine toxicity from SMILES strings

Brought to you by: prometheuslab

Tree [ecf3b1] master / History

HTTPS access

File	Date	Author	Commit
data	2025-11-30	francesca	[0bd02a] initial
experiments	2025-11-30	francesca	[0bd02a] initial
src	2025-11-30	francesca	[0bd02a] initial
AR_XAI_final_21112025.ipynb	2025-11-30	francesca	[0bd02a] initial
ER_XAI_final_21112025.ipynb	2025-11-30	francesca	[0bd02a] initial
README.md	2025-11-30	francesca	[0bd02a] initial
init_dataset.py	2025-11-30	francesca	[0bd02a] initial
losses.ipynb	2025-11-30	francesca	[0bd02a] initial
mols_andro_xai.pkl	2025-11-30	francesca	[0bd02a] initial
mols_estro_xai.pkl	2025-11-30	francesca	[0bd02a] initial
prediction_consistency_final_21112025.ipynb	2025-11-30	francesca	[0bd02a] initial
requirements.txt	2025-11-30	francesca	[0bd02a] initial
train.py	2025-11-30	francesca	[0bd02a] initial

Read Me

Descriptor-free BiLSTM model for Endocrine Toxicity Prediction

Hello!
Welcome to the official repository of 'Explainable Bidirectional Long Short-Term Memory Networks Learn Chemistry from SMILES for Predicting Toxicity of Androgen and Estrogen Receptor Targeting Chemicals'!
This project provides a descriptor-free deep learning framework based on a Bidirectional LSTM (BiLSTM) architecture to predict endocrine toxicity toward the androgen receptor (AR) and estrogen receptor (ER) directly from SMILES strings.

The model performs binary classification (toxic / non-toxic) and integrates explainable artificial intelligence (XAI) through Captum, enabling token-level and substructure-level interpretation of SMILES.

Project Structure

data/ - Raw dataset and augmented database.
src/ – Main source code of the project.
requirements.txt – List of required Python dependencies for setting up the environment.
experiments/ – Outputs from model training and prediction.Folder names follow the convention {key}={value}-{key}={value}-...

Installation and Reproducibility

To reproduce our experiments, you first need to download this codebase.
You can either click on the green button on the top-right corner of this page and download the codebase as a zip file or clone the repository with the following command, if you have git installed:
1. Clone the repository:

git clone https://
2. Set up the Python environment:

python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Model parameters

Parameter	Description
`hidden_size`	Hidden dimension of each LSTM layer.
`num_layers`	Number of BiLSTM layers in the model architecture.
`dense_size`	Dimension of the dense layer placed between the BiLSTM output and the final classifier.
`batch_size`	Number of training samples processed simultaneously during one training step.
`lr`	Learning rate used by the optimizer to update model weights.
`dropout`	Dropout rate for regularization to prevent overfitting.
`dir_experiments`	Directory where trained models, logs, and metrics are saved.
`dir_data`	Directory containing preprocessed datasets.
`n_splits`	Number of hold-out iterations used for resampling-based evaluation.
`random_state`	Random seed ensuring reproducibility of splits and model initialization.
`test_size`	Fraction of data allocated to the test set during dataset initialization.
`endpoint`	Toxicity endpoint to predict: `andro` or `estro`.
`aug`	Number of SMILES augmentations per molecule.
`max_epochs`	Maximum number of training epochs.
These parameters are used as arguments in the Python scripts for training and prediction.
### Scripts
1. Before training models, datasets for each endpoint must be initialized once.
Dataset initialization includes:
- Loading SMILES and labels.
- Tokenization.
- Optional SMILES augmentation.
- Train/test split creation.
- Saving processed tensors.

python init_dataset.py --endpoint andro --aug 5
python init_dataset.py --endpoint estro --aug 5

To run training with default settings:

python train.py

To modify hyperparameters from terminal:

python train.py --help

ToxFromSmiles Code

BiLTM models predicting endocrine toxicity from SMILES strings

Branches

Tree [ecf3b1] master /

History

Read Me

Descriptor-free BiLSTM model for Endocrine Toxicity Prediction

Project Structure

Installation and Reproducibility

Model parameters

Notebooks

Citation

License

ToxFromSmiles Code

BiLTM models predicting endocrine toxicity from SMILES strings

Branches

Tree [ecf3b1] master / Download Snapshot History

Read Me

Descriptor-free BiLSTM model for Endocrine Toxicity Prediction

Project Structure

Installation and Reproducibility

Model parameters

Notebooks

Citation

License

Tree [ecf3b1] master /

History