Menu

Tree [ecf3b1] master /
 History

HTTPS access


File Date Author Commit
 data 2025-11-30 francesca francesca [0bd02a] initial
 experiments 2025-11-30 francesca francesca [0bd02a] initial
 src 2025-11-30 francesca francesca [0bd02a] initial
 AR_XAI_final_21112025.ipynb 2025-11-30 francesca francesca [0bd02a] initial
 ER_XAI_final_21112025.ipynb 2025-11-30 francesca francesca [0bd02a] initial
 README.md 2025-11-30 francesca francesca [0bd02a] initial
 init_dataset.py 2025-11-30 francesca francesca [0bd02a] initial
 losses.ipynb 2025-11-30 francesca francesca [0bd02a] initial
 mols_andro_xai.pkl 2025-11-30 francesca francesca [0bd02a] initial
 mols_estro_xai.pkl 2025-11-30 francesca francesca [0bd02a] initial
 prediction_consistency_final_21112025.ipynb 2025-11-30 francesca francesca [0bd02a] initial
 requirements.txt 2025-11-30 francesca francesca [0bd02a] initial
 train.py 2025-11-30 francesca francesca [0bd02a] initial

Read Me

Descriptor-free BiLSTM model for Endocrine Toxicity Prediction

Hello!
Welcome to the official repository of 'Explainable Bidirectional Long Short-Term Memory Networks Learn Chemistry from SMILES for Predicting Toxicity of Androgen and Estrogen Receptor Targeting Chemicals'!
This project provides a descriptor-free deep learning framework based on a Bidirectional LSTM (BiLSTM) architecture to predict endocrine toxicity toward the androgen receptor (AR) and estrogen receptor (ER) directly from SMILES strings.

The model performs binary classification (toxic / non-toxic) and integrates explainable artificial intelligence (XAI) through Captum, enabling token-level and substructure-level interpretation of SMILES.

Project Structure

  • data/ - Raw dataset and augmented database.
  • src/ – Main source code of the project.
  • requirements.txt – List of required Python dependencies for setting up the environment.
  • experiments/ – Outputs from model training and prediction.Folder names follow the convention {key}={value}-{key}={value}-...

Installation and Reproducibility

To reproduce our experiments, you first need to download this codebase.
You can either click on the green button on the top-right corner of this page and download the codebase as a zip file or clone the repository with the following command, if you have git installed:
1. Clone the repository:

git clone https://
2. Set up the Python environment:

python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Model parameters

Parameter Description
hidden_size Hidden dimension of each LSTM layer.
num_layers Number of BiLSTM layers in the model architecture.
dense_size Dimension of the dense layer placed between the BiLSTM output and the final classifier.
batch_size Number of training samples processed simultaneously during one training step.
lr Learning rate used by the optimizer to update model weights.
dropout Dropout rate for regularization to prevent overfitting.
dir_experiments Directory where trained models, logs, and metrics are saved.
dir_data Directory containing preprocessed datasets.
n_splits Number of hold-out iterations used for resampling-based evaluation.
random_state Random seed ensuring reproducibility of splits and model initialization.
test_size Fraction of data allocated to the test set during dataset initialization.
endpoint Toxicity endpoint to predict: andro or estro.
aug Number of SMILES augmentations per molecule.
max_epochs Maximum number of training epochs.
These parameters are used as arguments in the Python scripts for training and prediction.
### Scripts
1. Before training models, datasets for each endpoint must be initialized once.
Dataset initialization includes:
- Loading SMILES and labels.
- Tokenization.
- Optional SMILES augmentation.
- Train/test split creation.
- Saving processed tensors.

python init_dataset.py --endpoint andro --aug 5
python init_dataset.py --endpoint estro --aug 5

  1. To run training with default settings:

python train.py

To modify hyperparameters from terminal:

python train.py --help

Notebooks

Citation

License