CYPstrate
Predicting substrates in the CYP mediated phase I metabolism for xenobiotics

About CYPstrate

CYPstrate consists of a collection of machine learning classifiers (random forest and support vector machines) for the prediction of substrates and non-substrates of the nine most important human CYP isozymes in the metabolism of xenobiotics (i.e. CYPs 1A2, 2A6, 2B6, 2C8, 2C9, 2C19, 2D6, 2E1 and 3A4). The models are trained on a high-quality data set of 1831 substrates and non-substrates compiled from public sources.

Two distinct prediction modes are available to cover different use cases (see below). Computation is currently limited to 500,000 compounds per query, which takes approximately 14 hours to calculate.

For more details, see [manuscript to be available soon].

Usage of the module:

Input

Molecules

You can provide a single molecule by either providing a SMILES string or drawing it in the JSME Molecule Editor. Multiple input molecules can be defined by uploading a file (.smi or .sdf) with up to 500.000 molecules.

The .smi file must contain one SMILES string per row, additional information for a molecule must be separated from the SMILES by at least one whitespace character (including tabs). At the moment an .sdf file is only parsed correctly if every entry is valid by the .sdf specification, otherwise no predictions will be carried out.

Prediction modes

CYPstrate offers two prediction modes:

In best performance mode, for each CYP isozyme, several models are combined by hard voting strategy (majority voting). This approach yields maximum accuracy but some compounds of interest may not be covered.

In full coverage mode the tool uses one classification model per CYP and guarantees full coverage of the input space for all molecules that are successfully preprocessed by the tool. These models still achieve a high prediction performance, but are worse compared to the models described above.

Molecular similarity

In order to provide a means to estimate the reliablity of a prediction result, the checkbox "Calculate molecular smiliarity" can be checked. For every molecule and CYP isoyzme the molecular similarity to the nearest neighbor in the respective training set of each model will be calculated.

Starting the calculation

By pressing the submit button the calculation will be started if a valid input has been provided, otherwise an error message will be shown. You will then be forwarded to the results page and a permanent link is shown that you can use to obtain your results, once the calculation is done.

Output

The results of the calculation will be displayed as a color-coded table indicating results that are considered negative with red and positive results with green. Users can download the results as a .csv file for later use. Results will be deleted permanently after 60 days or as soon as the user presses the "Delete results" button on the results page.

Explanation of the output columns of the results
Table 1. Explanation of the most important output columns.
Column name Description
Name Unique integer assigned to each submitted molecule.
Input SMILES SMILES as submitted by the user.
Filtered SMILES SMILES after preprocessing, for this representation the feature sets are calculated.
2D structure 2D structure of the preprocessed molecule.
CYP<X> Classification as substrate, non-substrate or non-predictable (see best performance classifiers) for the isozyme <X>,
where <X> is the three letter naming conventions indicating family, subfamily and member number of the isozyme.
Mol. sim to training set Always occurs after a CYP<X> column. The molecular similarity to the nearest neighbor in the training set used to
train the respective model. Molecular similarity is defined as Tanimoto coefficient by MACCs keys.
Error/Warnings Code for any errors or warnings thrown during the preprocessing procedure. See Table 2 for detailed explanations.
Explanation of preprocessing error codes

All molecules that are provided as input are preprocessed before the input features for the classification models are calculated. During the preprocessing procedure several errors can occur leading to the exclusion of the molecule, in some cases a warning is produced indicating that predictions can but imprecise. The returned error and warning codes are explained in the following table:

Table 2. Detailed explanation of errors and warnings.
Code Error message or warning
!1 Invalid or empty input. No output was produced. Further information may be provided by additional messages.
E1 Molecule contains elements that were not considered by the training set, the molecule is not valid and is excluded.
V1 Molecule cannot be converted from SMILES to rdkit mol and back, the molecule is not valid and is excluded.
CS Molecule contains elements or patterns that cannot be standardized by the chembl_structure_pipeline.
Molecule was not standardized before prediction, results might be imprecise.
Citing CYPstrate

Holmer, H.; de Bruyn Kops, C.; Stork, C.; Kirchmair, J. CYPstrate: A set of machine learning models for the accurate classification of cytochrome P450 enzyme substrates and non-substrates IN REVISION AT Molecules 2021.
DOI to be announced

Stork, C.; Embruch, G.; Šícho, M.; de Bruyn Kops, C.; Chen, Y.; Svozil, D.; Kirchmair, J. NERDD: a web portal providing access to in silico tools for drug discovery. Bioinformatics 2020.
doi: 10.1093/bioinformatics/btz695

Problems?

To report a problem or give feedback, please go to the Feedback & Support page or contact us directly: