| File | Date | Author | Commit |
|---|---|---|---|
| data | 2025-11-30 |
|
[0bd02a] initial |
| experiments | 2025-11-30 |
|
[0bd02a] initial |
| src | 2025-11-30 |
|
[0bd02a] initial |
| AR_XAI_final_21112025.ipynb | 2025-11-30 |
|
[0bd02a] initial |
| ER_XAI_final_21112025.ipynb | 2025-11-30 |
|
[0bd02a] initial |
| README.md | 2025-11-30 |
|
[0bd02a] initial |
| init_dataset.py | 2025-11-30 |
|
[0bd02a] initial |
| losses.ipynb | 2025-11-30 |
|
[0bd02a] initial |
| mols_andro_xai.pkl | 2025-11-30 |
|
[0bd02a] initial |
| mols_estro_xai.pkl | 2025-11-30 |
|
[0bd02a] initial |
| prediction_consistency_final_21112025.ipynb | 2025-11-30 |
|
[0bd02a] initial |
| requirements.txt | 2025-11-30 |
|
[0bd02a] initial |
| train.py | 2025-11-30 |
|
[0bd02a] initial |
Hello!
Welcome to the official repository of 'Explainable Bidirectional Long Short-Term Memory Networks Learn Chemistry from SMILES for Predicting Toxicity of Androgen and Estrogen Receptor Targeting Chemicals'!
This project provides a descriptor-free deep learning framework based on a Bidirectional LSTM (BiLSTM) architecture to predict endocrine toxicity toward the androgen receptor (AR) and estrogen receptor (ER) directly from SMILES strings.
The model performs binary classification (toxic / non-toxic) and integrates explainable artificial intelligence (XAI) through Captum, enabling token-level and substructure-level interpretation of SMILES.
To reproduce our experiments, you first need to download this codebase.
You can either click on the green button on the top-right corner of this page and download the codebase as a zip file or clone the repository with the following command, if you have git installed:
1. Clone the repository:
git clone https://
2. Set up the Python environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
| Parameter | Description |
|---|---|
hidden_size |
Hidden dimension of each LSTM layer. |
num_layers |
Number of BiLSTM layers in the model architecture. |
dense_size |
Dimension of the dense layer placed between the BiLSTM output and the final classifier. |
batch_size |
Number of training samples processed simultaneously during one training step. |
lr |
Learning rate used by the optimizer to update model weights. |
dropout |
Dropout rate for regularization to prevent overfitting. |
dir_experiments |
Directory where trained models, logs, and metrics are saved. |
dir_data |
Directory containing preprocessed datasets. |
n_splits |
Number of hold-out iterations used for resampling-based evaluation. |
random_state |
Random seed ensuring reproducibility of splits and model initialization. |
test_size |
Fraction of data allocated to the test set during dataset initialization. |
endpoint |
Toxicity endpoint to predict: andro or estro. |
aug |
Number of SMILES augmentations per molecule. |
max_epochs |
Maximum number of training epochs. |
| These parameters are used as arguments in the Python scripts for training and prediction. | |
| ### Scripts | |
| 1. Before training models, datasets for each endpoint must be initialized once. | |
| Dataset initialization includes: | |
| - Loading SMILES and labels. | |
| - Tokenization. | |
| - Optional SMILES augmentation. | |
| - Train/test split creation. | |
| - Saving processed tensors. |
python init_dataset.py --endpoint andro --aug 5
python init_dataset.py --endpoint estro --aug 5
python train.py
To modify hyperparameters from terminal:
python train.py --help