Documentation of Hit Dexter 2.0

Hit Dexter and Hit Dexter 2.0 versions

As the workflow for data preparation was updated and improved during Hit Dexter 2.0, only these data sets are used. This means Hit Dexter as it was in the very beginning does not exist anymore.

Further the models that are accessible via the web server are not the models that were published in the Hit Dexter 2.0 paper, but were built with the same procedure described in the paper but with updated versions of scikit-learn[1] (now: 0.20.1; old: 0.19.1), imbalance learn[2] (now: 0.4.3; old: 0.3.1) and RDKit[3] (now: 2018.09.1; old: 2017.09.3).

How to cite Hit Dexter 2.0

Publications including results obtained with Hit Dexter 2.0 should cite the respective publication (doi.org/10.1021/acs.jcim.8b00677)[4] and URL of the Hit Dexter 2.0 web service (nerdd.univie.ac.at/hitdexter/).

Introduction

Hit Dexter 2.0 is a freely available web service for assessing the risk of small molecules to cause false-positive readouts in biochemical assays. Predictions obtained with Hit Dexter 2.0 may serve as a valuable tool for decision support and compound deprioritization but is not intended for use as a hard filter for discarding compounds. For a detailed discussion of the scope and limitations of Hit Dexter 2.0, see our recent publication.[4]

Users may request a copy of the web server software package for local installation and use (please contact us).

Hit Dexter 2.0 features a number of different models and approaches for compound vetting:

The definitions of the individual promiscuity classes are summarized in Table 1. For example, compounds are classified as “highly promiscuous” if their active-to-tested ratio (ATR) is greater than 5.4% for PSAs and greater than 10% for CDRAs.

The ATR defines the ratio between the number of protein clusters for which a compound was measured as active on at least one protein of that cluster vs. the total number of protein clusters a compound was measured on.

Each of the classifiers was trained on approximately 250k compounds (Table 1) which have been measured for activity on at least 50 distinct proteins (i.e. proteins with distinct sequence identity as determined by a protein sequence clustering approach). Morgan2 fingerprints served as descriptors.

Table 1. Composition of the Data Sets and Definition of Thresholds for Class Labeling.

Assigned promiscuity class

Number of unique compounds in

Threshold definitiona

Threshold value

Data set

PSA50

CDRA50

PSA50b

CDRA50b

Non-promiscuous (NP)

Training set:

222 272

211 264

ATR < ATRmean

0.008

0.015

Promiscuous (P)

Training set:

26 117

30 478

ATR > ATRmean + 1σc

0.024

0.043

Highly promiscuous

(HP) - a subset of compounds labeled P

Training set:

5 956

5 609

ATR > ATRmean + 3σc

0.054

0.100

a Derived as part of our previous work.[5] Compounds with ATRs between ATRmean and ATRmean + 1σ were not assigned a promiscuity label and removed from all data sets.

b ATR threshold values calculated for the individual data sets according to the ATR threshold definition.

c Standard deviation.

Similarity-based approaches to measure the distance to known aggregators and dark chemical matter

Hit Dexter 2.0 reports the distance to the closest known aggregate-forming compound (aggregator) collected in a large data set that was published in 2015.[6] Hit Dexter 2.0 retrieves similarity values directly from the Aggregator Advisor web service.[7] The reported similarity is a Tanimoto coefficient based on the standard axonpath fingerprint in JChemBase (ChemAxon, Chemaxon.com). If the Aggregator Advisor web service cannot be reached at this time, Hit Dexter 2.0 will not report predictions for this calculation (the value reported will be “-1”or “null”).

Hit Dexter 2.0 also reports the distance of the compound(s) of interest to the closest of 140k compounds known as dark chemical matter.[8] Dark chemical matter describes compounds which have been tested in at least one hundred different biochemical assays and have never shown activity.

Rule-based approaches to flag undesirable molecules

As additional information the following rule sets are provided:

Patterns that cannot be parsed by RDKit were corrected according to Table 2.

Table 2: Corrected SMARTS patterns.

Pattern source

Pattern from the rule

Corrected pattern

Alternative correction (not used)

BMS

[F,Cl,Br,I,$(O(S(=O)(=O)))]-[CH,CH2;!$(CF2)]-[N,n]

[F,Cl,Br,I,$(O(S(=O)(=O)))]-[CH,CH2;!$(C(F)F)]-[N,n]

[F,Cl,Br,I,$(O(S(=O)(=O)))]-[CH,CH2;!$(CF)]-[N,n]

BMS

[N,n,O,S;!$(S(=O)(=O))]-[CH,CH2;!$(CF2)][F,Cl,Br,I,$(O(S(=O)(=O)))]

[N,n,O,S;!$(S(=O)(=O))]-[CH,CH2;!$(C(F)F)][F,Cl,Br,I,$(O(S(=O)(=O)))]

[N,n,O,S;!$(S(=O)(=O))]-[CH,CH2;!$(CF)][F,Cl,Br,I,$(O(S(=O)(=O)))]

BMS

[CH2,$(CF2);R0][CH2,$(CF2);R0][CH2,$(CF2);R0][CH2,$(CF2);R0][CH2,$(CF2);R0][CH2,$(CF2);R0][CH2,$(CF2);R0][CH2,$(CF2);R0]

[CH2,$(C(F)F);R0][CH2,$(C(F)F);R0][CH2,$(C(F)F);R0][CH2,$(C(F)F);R0][CH2,$(C(F)F);R0][CH2,$(C(F)F);R0][CH2,$(C(F)F);R0][CH2,$(C(F)F);R0]

[CH2,$(CF);R0][CH2,$(CF);R0][CH2,$(CF);R0][CH2,$(CF);R0][CH2,$(CF);R0][CH2,$(CF);R0][CH2,$(CF);R0][CH2,$(CF);R0]

BMS

[[CH;!R];!$(C-N)]=C([$(S(=O)(=O)),$(C(F)(F)(F)),$(C#N),$(N(=O)(=O)),$([N+](=O)[O-]),$(C(=O))])([$(S(=O)(=O)),$(C(F)(F)(F)),$(C#N),$(N(=O)(=O)),$([N+](=O)[O-]),$(C(=O))])

[CH;!R;!$(C-N)]=C([$(S(=O)(=O)),$(C(F)(F)(F)),$(C#N),$(N(=O)(=O)),$([N+](=O)[O-]),$(C(=O))])([$(S(=O)(=O)),$(C(F)(F)(F)),$(C#N),$(N(=O)(=O)),$([N+](=O)[O-]),$(C(=O))])

-

MLSMR

[N,C,S,O]-&!@[N,C,S,O]&!@[N,C,S,O]&!@[N,C,S,O]&!@[N,C,S,O]&!@[N,C,S,O]&!@[N,C,S,O]

[N,C,S,O]-&!@[N,C,S,O]-&!@[N,C,S,O]-&!@[N,C,S,O]-&!@[N,C,S,O]-&!@[N,C,S,O]-&!@[N,C,S,O]

[N,C,S,O]-&!@[N,C,S,O]!@[N,C,S,O]!@[N,C,S,O]!@[N,C,S,O]!@[N,C,S,O]!@[N,C,S,O]

MLSMR

ac-*=&!@*-&!@C(=O)&!@ca

ac-*=&!@*-&!@C(=O)-&!@ca

-

MLSMR

[#6,#7]&!@[#6](=&!@[CH])&!@C(=O)-&!@[C,N,O,S]

[#6,#7]-&!@[#6](=&!@[CH])-&!@C(=O)-&!@[C,N,O,S]

-

MLSMR

*-C(=O)-&!@[NH]-C&!@C(=O)-&!@[NH]-*

*-C(=O)-&!@[NH]-C-&!@C(=O)-&!@[NH]-*

-

MLSMR

([#6]OP(=O)(*)O[#6].[#6]OP(=O)(*)O[#6].[#6]OP(=O)(*)O[#6])

[#6]OP(=O)(*)O[#6]

-

MLSMR

c12cccc(C(=O)N(&!@C)C(=O)3)c2c3ccc1

c12cccc(C(=O)N(-&!@C)C(=O)3)c2c3ccc1

-

Usage

Data preparation and upload

Molecular structures are loaded onto the Hit Dexter 2.0 web service either by directly drawing a molecule with the JSME editor,[17] by pasting a SMILES into the field “Enter SMILES”, or by uploading a text file containing a list of SMILES. Hit Dexter 2.0 runs a thorough data preparation protocol to standardize the structural input. Therefore, chemical structures do not need to be preprocessed by the user with respect to hydrogen annotation, aromatization, protonation, tautomerism and stereochemistry. Salts are also recognized, and the minor components removed prior to calculations. For further information, see the original publication of Hit Dexter 2.0 (www.dx.doi.org/10.1021/acs.jcim.8b00677).

Lists of SMILES should be formatted as shown in the following examples:

Example 1: One SMILES per row with no additional data

CCOC(=O)N1CCN(CC1)C2=C(C(=O)C2=O)N3CCN(CC3)C4=CC=C(C=C4)OC

C1=CC(=C(C=C1C2=C(C(=O)C3=C(C=C(C=C3O2)O)O)O)O)O

C1=CC(=C(C=C1O)O)C(=O)C=CC2=CC(=C(C(=C2)O)O)O

Example 2: One SMILES per row with additional data (only space is allowed at the moment)

CCOC(=O)N1CCN(CC1)C2=C(C(=O)C2=O)N3CCN(CC3)C4=CC=C(C=C4)OC PhantomPAINSexample

C1=CC(=C(C=C1C2=C(C(=O)C3=C(C=C(C=C3O2)O)O)O)O)O exampleAggegator

C1=CC(=C(C=C1O)O)C(=O)C=CC2=CC(=C(C(=C2)O)O)O exampleReactive

Running the calculations

Calculations are started by clicking the “Submit” button. A new web page will load which reports on the progress of calculations and displays a web link that allows users to return and inspect the results ones all calculations have been completed.

Analyzing the results

The results page mainly consists of a table (heat map) which presents the predictions for the query molecules and models.

Users not familiar with the methods and rule-sets employed by Hit Dexter are advised to closely observe the assessment provided in the “Comments” field. “Comments” summarizes the outcomes of all individual models of Hit Dexter 2.0 and also gives indications of the confidence of individual predictions (based on the similarity of the query compound(s) to the nearest neighbors in the training sets).

The following color gradients are used in the table:

The table rows can be sorted by clicking on the sorter symbols located within the header cells of the respective columns.

Providing exactly the same SMILES more than once results in deletion of the duplicates. If names are provided only the name of the last entry (with respect to the input file) is kept.

Exporting results

All results can be downloaded as a CSV file by clicking the “Download (.csv)” button on the results page.

Analyzing error messages and warnings

For any structural input that cannot be successfully processed and standardized by Hit Dexter 2.0, an error code is reported. A description of the individual error codes is provided in Table 3. In case of questions, use the feedback and support page of our web service to get in touch with us (https://nerdd.univie.ac.at/feedback_support/).

If the Aggregator Advisor web service cannot be reached at this time, Hit Dexter 2.0 will not report predictions for this calculation (the value reported will be “-1”or “null”).

Table 3: Error Messages and Warnings.

Code

Error message or warning

!1

Invalid or empty input. No output was produced. In combination with one of the other messages, the other message gives the reason for the invalidity.

A1

Aggregator Advisor server not available at the moment

S1

The salt filter identified a multi-compound SMILES for which the core component could not be determined. A result was generated from, the original input, but is probably unreliable.

S0

The salt filter has removed at least one component of the input SMILES.

W1

The molecular weight is not between 200 and 900 Da. The prediction result may be unreliable.

E1

Element types other than those present in the training data were detected. A result was generated but is probably unreliable.

C1

Molecule is broken during canonalize procedure. Comes always with ‘!1’

N1

Molecule is broken during neutralization procedure. Comes always with ‘!1’

NNPH

Similarity of the compound to the training set is below 0.5. Prediction of high promiscuous PSA may be unreliable.

NNPM

Similarity of the compound to the training set is below 0.5. Prediction of moderate or high promiscuous PSA may be unreliable.

NNCH

Similarity of the compound to the training set is below 0.5. Prediction of high promiscuous CDRA may be unreliable.

NNCM

Similarity of the compound to the training set is below 0.5. Prediction of moderate or high promiscuous CDRA may be unreliable.

License

The Hit Dexter 2.0 software package is published with the GNU GENERAL PUBLIC LICENSE Version 3.

Contact, Suggestions and Bug Report

Conrad Stork: stork@zbh.uni-hamburg.de

Johannes Kirchmair: kirchmair@zbh.uni-hamburg.de

Funding

CS and JK are supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - project number KI 2085/1-1. JK is also supported by the Bergen Research Foundation (BFS) - grant no. BFS2017TMT01. YC is supported by the China Scholarship Council (201606010345). MS is supported by the Ministry of Education of the Czech Republic - project numbers NPU I-LO1220 and LM2015063.

References

[1]        Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E., J. Mach. Learn. Res. 2011, 12, 2825–2830.

[2]        Guillaume Lemaitre, Fernando Nogueira, Christos K. Aridas, J. Mach. Learn. Res. 2017, 18, 1–5.

[3]        RDKit: Open-Source Cheminformatics Software: http://www.rdkit.org.

[4]        C. Stork, Y. Chen, M. Šícho, J. Kirchmair, J. Chem. Inf. Model. 2019, 59, 1030–1043.

[5]        C. Stork, J. Wagner, N.-O. Friedrich, C. de Bruyn Kops, M. Šícho, J. Kirchmair, ChemMedChem 2018, 13, 564–571.

[6]        J. J. Irwin, D. Duan, H. Torosyan, A. K. Doak, K. T. Ziebart, T. Sterling, G. Tumanian, B. K. Shoichet, J. Med. Chem. 2015, 58, 7076–7087.

[7]        “Aggregator Advisor / Shoichet Laboratory @ UCSF,” can be found under http://advisor.bkslab.org/.

[8]        A. M. Wassermann, E. Lounkine, D. Hoepfner, G. Le Goff, F. J. King, C. Studer, J. M. Peltier, M. L. Grippo, V. Prindle, J. Tao, et al., Nat. Chem. Biol. 2015, 11, 958–966.

[9]        M. Hann, B. Hudson, X. Lewell, R. Lifely, L. Miller, N. Ramsden, J. Chem. Inf. Comput. Sci. 1999, 39, 897–902.

[10]        R. Brenk, A. Schipani, D. James, A. Krasowski, I. H. Gilbert, J. Frearson, P. G. Wyatt, ChemMedChem 2008, 3, 435–444.

[11]        B. C. Pearce, M. J. Sofia, A. C. Good, D. M. Drexler, D. A. Stock, J. Chem. Inf. Model. 2006, 46, 1060–1068.

[12]        I. Sushko, E. Salmina, V. A. Potemkin, G. Poda, I. V. Tetko, J. Chem. Inf. Model. 2012, 52, 2310–2316.

[13]        “NIH Molecular Libraries Small Molecule Repository,” can be found under https://grants.nih.gov/grants/guide/notice-files/not-rm-07-005.html.

[14]        J. Blake, McCalls 2005, 1, 649–655.

[15]        J. B. Baell, G. A. Holloway, J. Med. Chem. 2010, 53, 2719–2740.

[16]        S. J. Chakravorty, J. Chan, M. N. Greenwood, I. Popa-Burke, K. S. Remlinger, S. D. Pickett, D. V. S. Green, M. C. Fillmore, T. W. Dean, J. I. Luengo, et al., SLAS Discov 2018, 23, 532–544.

[17]        B. Bienfait, P. Ertl, J. Cheminform. 2013, 5, 24.