Digging through some Folding@Home data

32 minute read

Published:

Learning cheminformatics from some Folding@Home data

png

Top 10 (based on Hybrid2 docking score) small molecules

2020-05-06 - 2020-05-11

I have no formal training in cheminformatics, so I am going to be stumbling and learning as I wade through this dataset. I welcome any learning lessons from experts.

This will be an ongoing foray

Source: https://github.com/FoldingAtHome/covid-moonshot

Introduction

Folding@Home is a distributed computing project - allowing molecular simulations to be run in parallel across thousands of different computers with minimal communication. This, combined with other molecular modeling methods, has yielded a lot of open data for others to examine. In particular, I’m interested in the docking screens and compounds targeted by the F@H and postera collaborations

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
pd.options.display.max_columns = 999

moonshot_df = pd.read_csv('moonshot-submissions/covid_submissions_all_info.csv')
moonshot_df.head()
SMILESCIDcreatorfragmentslinkreal_spaceSCRBBextended_real_spacein_molport_or_mculein_ultimate_mculein_emoleculescovalent_fragcovalent_warheadacrylamideacrylamide_adductchloroacetamidechloroacetamide_adductvinylsulfonamidevinylsulfonamide_adductnitrilenitrile_adductMWcLogPHBDHBATPSAnum_criterion_violationsBMSDundeeGlaxoInpharmaticaLINTMLSMRPAINSSureChEMBLPostEraORDEREDMADEASSAYED
0CCN(Cc1cccc(-c2ccncc2)c1)C(=O)Cn1nnc2ccccc21AAR-POS-8a4e0f60-1Aaron Morris, PostErax0072https://covid.postera.ai/covid/submissions/AAR...Z1260533612FALSEFALSEFALSEFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse371.4443.54200563.910PASSPASSPASSPASSPASSPASSPASSPASSPASSTrueFalseFalse
1O=C(Cn1nnc2ccccc21)NCc1ccc(Oc2cccnc2)c(F)c1AAR-POS-8a4e0f60-10Aaron Morris, PostErax0072https://covid.postera.ai/covid/submissions/AAR...Z826180044FALSEFALSEs_22____1723102____13206668FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse377.3793.07411681.930PASSPASSPASSPASSPASSPASSPASSPASSPASSTrueFalseFalse
2CN(Cc1nnc2ccccn12)C(=O)N(Cc1cccs1)c1ccc(Br)cc1AAR-POS-8a4e0f60-11Aaron Morris, PostErax0072https://covid.postera.ai/covid/submissions/AAR...FALSEFALSEFALSEFALSEFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse456.3694.81190553.740PASSPASSPASSFilter9_metalaryl bromidePASSPASSPASSPASSTrueFalseFalse
3CCN(Cc1cccc(-c2ccncc2)c1)C(=O)Cc1noc2ccccc12AAR-POS-8a4e0f60-2Aaron Morris, PostErax0072https://covid.postera.ai/covid/submissions/AAR...Z1260535907FALSEFALSEFALSEFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse371.4404.48100459.230PASSPASSPASSPASSPASSPASSPASSPASSPASSTrueFalseFalse
4O=C(NCc1noc2ccccc12)N(Cc1cccs1)c1ccc(F)cc1AAR-POS-8a4e0f60-3Aaron Morris, PostErax0072https://covid.postera.ai/covid/submissions/AAR...FALSEFALSEFALSEs_272164____9388766____17338746FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse381.4324.94481458.370PASSPASSPASSPASSPASSPASSPASSPASSPASSTrueFalseFalse

The moonshot data has a lot of logging/metadata information, some one-hot-encoding information about functional groups, and some additional columns about Glaxo, Dundee, BMS, Lint, PAINS, SureChEMBL - I’m not sure what those additional coluns mean, but the values are binary values, possibly the results of some other test or availability in another databases.

I’m going to focus on the molecular properties: MW, cLogP, HBD, HBA, TPSA

  • MW: Molecular Weight
  • cLogP: The logarithm of the partition coefficient (ratio of concentrations in octanol vs water, $\log{\frac{c_{octanol}}{c_{water}}}$)
  • HBD: Hydrogen bond donors
  • HBA: Hydrogen bond acceptors
  • TPSA: Topological polar surface area

Some of the correlations make some chemical sense - heavier molecules have more heavy atoms (O, N, F, etc.), but these heavier atoms are also the hydrogen bond acceptors. By that logic, more heavy atoms also coincides with more electronegative atoms, increasing your TPSA. It’s a little convoluted because TPSA looks at the surface, not necessarily the volume of the compound; geometry/shape will influence TPSA. There don’t appear to be any strong correlations with cLogP. Partition coefficients are a complex function of polarity, size/sterics, and shape - a 1:1 correlation with a singular, other variable will be hard to pinpoint

This csv file doesn’t have much other numerical data, but maybe some of those true/false, pass/fail data might be relevant…but I definitely need more context here

fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
cols = ['MW', 'cLogP', 'HBD', 'HBA', 'TPSA']
ax.matshow(moonshot_df[cols].corr(), cmap='RdBu')

ax.set_xticks([i for i,_ in enumerate(cols)])
ax.set_xticklabels(cols)

ax.set_yticks([i for i,_ in enumerate(cols)])
ax.set_yticklabels(cols)

for i, (rowname, row) in enumerate(moonshot_df[cols].corr().iterrows()):
    for j, (key, val) in enumerate(row.iteritems()):
        ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")

png

Some docking results

Okay here’s a couple other CSVs I found, these include some docking scores

  • Repurposing scores: “The Drug Repurposing Hub is a curated and annotated collection of FDA-approved drugs, clinical trial drugs, and pre-clinical tool compounds with a companion information resource” source here, so a public dataset of some drugs
  • Redock scores: “This directory contains experiments in redocking all screened fragments into the entire ensemble of X-ray structures.” Taking fragments and re-docking them
repurposing_df = pd.read_csv('repurposing-screen/drugset-docked.csv')
redock_df = pd.read_csv('redock-fragments/all-screened-fragments-docked.csv')

SMILES strings, names, docking scores

repurposing_df.head()
SMILESTITLEHybrid2docked_fragmentMpro-_docksite
0C[C@@H](c1ccc-2c(c1)Cc3c2cccc3)C(=O)[O-]CHEMBL2104122-11.519580x07490.509349active-covalent
1C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@]2(C#C)O)CCC4...CHEMBL1387-10.580162x07492.706928active-covalent
2CC(C)(C)c1cc(cc(c1O)C(C)(C)C)/C=C\2/C(=O)NC(=[...CHEMBL275835-10.557229x01071.801830active-noncovalent
3C[C@]12CC[C@@H]3[C@H]4CCCCC4=CC[C@H]3[C@@H]1CC...CHEMBL2104104-10.480992x07493.791700active-covalent
4CC(=O)[C@]1(CC[C@@H]2[C@@]1(CCC3=C4CCC(=O)C=C4...CHEMBL2104231-10.430775x07494.230903active-covalent

Hybrid2 looks like a docking method provided via OpenEye. Mpro likely refers to COVID-19 main protease. I’m not entirely sure what the receptor for “Hybrid2” is, but there seem to be multiple “sites” or “fragments” for docking. There are lots of different fragments, but very few sites. For each site-fragment combination, multiple small molecules may have been tested.

repurposing_df['docked_fragment'].value_counts()
x0195    114
x0749     69
x0678     58
x0397     45
x0104     24
x0161     21
x1077     19
x0072     14
x0874     13
x0354     13
x0689     10
x1382      7
x0708      4
x0434      4
x1093      3
x1392      2
x0395      2
x1402      2
x0831      2
x0107      2
x1385      2
x1418      2
x0387      2
x0830      2
x1478      1
x0786      1
x1187      1
x0692      1
x0967      1
x0426      1
x0305      1
x0946      1
x1386      1
x0759      1
Name: docked_fragment, dtype: int64
repurposing_df['site'].value_counts()
active-noncovalent    338
active-covalent       107
dimer-interface         1
Name: site, dtype: int64
repurposing_df.groupby(["docked_fragment", "site"]).count()
SMILESTITLEHybrid2Mpro-_dock
docked_fragmentsite
x0072active-noncovalent14141414
x0104active-noncovalent24242424
x0107active-noncovalent2222
x0161active-noncovalent21212121
x0195active-noncovalent114114114114
x0305active-noncovalent1111
x0354active-noncovalent13131313
x0387active-noncovalent2222
x0395active-noncovalent2222
x0397active-noncovalent45454545
x0426active-noncovalent1111
x0434active-noncovalent4444
x0678active-noncovalent58585858
x0689active-covalent10101010
x0692active-covalent1111
x0708active-covalent4444
x0749active-covalent69696969
x0759active-covalent1111
x0786active-covalent1111
x0830active-covalent2222
x0831active-covalent2222
x0874active-noncovalent13131313
x0946active-noncovalent1111
x0967active-noncovalent1111
x1077active-noncovalent19191919
x1093active-noncovalent3333
x1187dimer-interface1111
x1382active-covalent7777
x1385active-covalent2222
x1386active-covalent1111
x1392active-covalent2222
x1402active-covalent2222
x1418active-covalent2222
x1478active-covalent1111

Some molecules show up multiple times - why? Upon further investigation, this is mainly due to the molecule’s presence in multiple databases

repurposing_df.groupby(['SMILES']).count().sort_values("TITLE")
TITLEHybrid2docked_fragmentMpro-_docksite
SMILES
B(CCCC)(O)O11111
CCCc1ccccc1N11111
CCCc1cc(=O)[nH]c(=S)[nH]111111
CCC[N@@H+]1CCO[C@H]2[C@H]1CCc3c2cc(cc3)O11111
CCC[N@@H+]1CCC[C@H]2[C@H]1Cc3c[nH]nc3C211111
..................
C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@]2(C#C)O)CCC4=CC(=O)CC[C@H]3422222
C[C@]12CC[C@H]3[C@H]([C@@H]1CCC2=O)CC(=C)C4=CC(=O)C=C[C@]34C22222
CC(C)C[C@@H](C1(CCC1)c2ccc(cc2)Cl)[NH+](C)C22222
CC[C@](/C=C/Cl)(C#C)O22222
CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CCC4=CC(=O)CC[C@H]3422222

432 rows × 5 columns

repurposing_df[repurposing_df['SMILES']=="CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CCC4=CC(=O)CC[C@H]34"]
SMILESTITLEHybrid2docked_fragmentMpro-_docksite
82CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CC...CHEMBL2107797-9.002963x07492.616094active-covalent
105CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CC...EDRUG178-8.705896x01042.248707active-noncovalent

There doesn’t seem to be a very good correlation between the two docking scores - if these are docking scores to different receptors, that would help explain things. It’s worth noting that we’re not seeing if the two numbers agree for each molecule, but if the trends persist (both scores go up for this molecule, but go down for this other molecule). The weak correlation suggests the trends do not persist between the two docking measures

repurposing_df[['Hybrid2', 'Mpro-_dock']].corr()
Hybrid2Mpro-_dock
Hybrid21.0000000.581966
Mpro-_dock0.5819661.000000

Redocking dataframe: SMILES, names, data collection information, docking scores

redock_df.head()
SMILESTITLEfragmentsCompoundCodeUnnamed: 4covalent_warheadMountingResultDataCollectionOutcomeDataProcessingResolutionHighRefinementOutcomeDeposition_PDB_IDHybrid2docked_fragmentMpro-x0500_docksite
0c1ccc(c(c1)NCc2ccn[nH]2)Fx0500x0500Z1545196403NaNFalseOK: No comment:No commentsuccess2.197 - Analysed & RejectedNaN-11.881923x0678-2.501554active-noncovalent
1Cc1ccccc1OCC(=O)Nc2ncccn2x0415x0415Z53834613NaNFalseOK: No comment:No commentsuccess1.627 - Analysed & RejectedNaN-11.622278x0678NaNactive-noncovalent
2Cc1csc(n1)CNC(=O)c2ccn[nH]2x0356x0356Z466628048NaNFalseOK: No comment:No commentsuccess3.257 - Analysed & RejectedNaN-11.435024x0678NaNactive-noncovalent
3Cc1csc(n1)CNC(=O)c2ccn[nH]2x1113x1113Z466628048NaNFalseOK: No comment:No commentsuccess1.577 - Analysed & RejectedNaN-11.435024x0678NaNactive-noncovalent
4c1cc(cnc1)NC(=O)CC2CCCCC2x0678x0678Z31792168NaNFalseMounted_Clearsuccess1.836 - Deposited5R84-11.355046x0678NaNactive-noncovalent

There don’t seem to be many Mpro docking scores in this dataset (only one molecule has a non-null Mpro docking score)

redock_df[redock_df['Mpro-x0500_dock'].isnull()].count()
SMILES                          1452
TITLE                           1452
fragments                       1452
CompoundCode                    1452
Unnamed: 4                         0
covalent_warhead                1452
MountingResult                  1452
DataCollectionOutcome           1452
DataProcessingResolutionHigh    1357
RefinementOutcome               1306
Deposition_PDB_ID                 78
Hybrid2                         1452
docked_fragment                 1452
Mpro-x0500_dock                    0
site                            1452
dtype: int64
redock_df[~redock_df['Mpro-x0500_dock'].isnull()].count()
SMILES                          1
TITLE                           1
fragments                       1
CompoundCode                    1
Unnamed: 4                      0
covalent_warhead                1
MountingResult                  1
DataCollectionOutcome           1
DataProcessingResolutionHigh    1
RefinementOutcome               1
Deposition_PDB_ID               0
Hybrid2                         1
docked_fragment                 1
Mpro-x0500_dock                 1
site                            1
dtype: int64

Are there overlaps in the molecules in each of these datasets?

repurpose_redock = repurposing_df.merge(redock_df, on='SMILES', how='inner',suffixes=("_L", "_R"))
moonshot_redock = moonshot_df.merge(redock_df, on='SMILES', how='inner',suffixes=("_L", "_R"))
repurpose_redock
SMILESTITLE_LHybrid2_Ldocked_fragment_LMpro-_docksite_LTITLE_RfragmentsCompoundCodeUnnamed: 4covalent_warheadMountingResultDataCollectionOutcomeDataProcessingResolutionHighRefinementOutcomeDeposition_PDB_IDHybrid2_Rdocked_fragment_RMpro-x0500_docksite_R
0Cc1cc(=O)n([nH]1)c2ccccc2CHEMBL290916-7.889587x0195-2.068452active-noncovalentx0297x0297Z50145861NaNFalseOK: No comment:No commentsuccess1.987 - Analysed & RejectedNaN-7.889587x0195NaNactive-noncovalent
1CC(C)Nc1ncccn1CHEMBL1740513-7.178702x0072-1.248482active-noncovalentx0583x0583Z31190928NaNFalseOK: No comment:No commentsuccess3.087 - Analysed & RejectedNaN-7.293537x1093NaNactive-noncovalent
2CC(C)Nc1ncccn1CHEMBL1740513-7.178702x0072-1.248482active-noncovalentx1102x1102Z31190928NaNFalseOK: No comment:No commentsuccess1.467 - Analysed & RejectedNaN-7.293537x1093NaNactive-noncovalent
3C[C@H](C(=O)[O-])OCHEMBL1200559-5.675188x0397-0.179049active-noncovalentx1035x1035Z1741982441NaNFalseOK: No comment:No commentFailed - no diffractionNaNNaNNaN-6.505556x0397NaNactive-noncovalent
4CC(=O)C(=O)[O-]DB00119-5.448891x0689-0.494791active-covalentx1037x1037Z1741977082NaNFalseOK: No comment:No commentFailed - no diffractionNaNNaNNaN-5.448891x0689NaNactive-covalent
5CCC(=O)[O-]CHEMBL14021-5.374838x0397-0.555688active-noncovalentx1029x1029Z955123616NaNFalseOK: No comment:No commentsuccess1.737 - Analysed & RejectedNaN-5.135675x0689NaNactive-covalent
6C1CNCC[NH2+]1CHEMBL1412-5.079155x03541.716032active-noncovalentx0996x0996Z1245537944NaNFalseOK: No comment:No commentsuccess1.967 - Analysed & RejectedNaN-4.675085x0354NaNactive-noncovalent

We joined on SMILES string, and now we can compare the docking scores between the repurposing and redocking datasets.

Some Hybrid2 scores look quantitatively similar, but for those that don’t, the ranking is still there. Looking at the COVID-19 main protease (Mpro I believe?), the docking scores don’t follow similar rankings - docking scores aren’t transferable to different receptors (this might be a fairly obvious observation)

repurpose_redock[['SMILES', "TITLE_L", "TITLE_R", "Hybrid2_L", "Hybrid2_R", 'Mpro-_dock', 'Mpro-x0500_dock']]
SMILESTITLE_LTITLE_RHybrid2_LHybrid2_RMpro-_dockMpro-x0500_dock
0Cc1cc(=O)n([nH]1)c2ccccc2CHEMBL290916x0297-7.889587-7.889587-2.068452NaN
1CC(C)Nc1ncccn1CHEMBL1740513x0583-7.178702-7.293537-1.248482NaN
2CC(C)Nc1ncccn1CHEMBL1740513x1102-7.178702-7.293537-1.248482NaN
3C[C@H](C(=O)[O-])OCHEMBL1200559x1035-5.675188-6.505556-0.179049NaN
4CC(=O)C(=O)[O-]DB00119x1037-5.448891-5.448891-0.494791NaN
5CCC(=O)[O-]CHEMBL14021x1029-5.374838-5.135675-0.555688NaN
6C1CNCC[NH2+]1CHEMBL1412x0996-5.079155-4.6750851.716032NaN

Joining the moonshot submission and redocking datasets does not yield too many overlapping molecules

moonshot_redock
SMILESCIDcreatorfragments_Llinkreal_spaceSCRBBextended_real_spacein_molport_or_mculein_ultimate_mculein_emoleculescovalent_fragcovalent_warhead_Lacrylamideacrylamide_adductchloroacetamidechloroacetamide_adductvinylsulfonamidevinylsulfonamide_adductnitrilenitrile_adductMWcLogPHBDHBATPSAnum_criterion_violationsBMSDundeeGlaxoInpharmaticaLINTMLSMRPAINSSureChEMBLPostEraORDEREDMADEASSAYEDTITLEfragments_RCompoundCodeUnnamed: 4covalent_warhead_RMountingResultDataCollectionOutcomeDataProcessingResolutionHighRefinementOutcomeDeposition_PDB_IDHybrid2docked_fragmentMpro-x0500_docksite
0CC(C)Nc1cccnc1MAK-UNK-2c1752f0-4Maksym Voznyyx1093https://covid.postera.ai/covid/submissions/MAK...FALSEZ2574930241EN300-56005FALSEFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse136.1981.90191224.920PASSPASSPASSPASSPASSPASSPASSPASSPASSFalseFalseFalsex1098x1098Z1259341037NaNFalseOK: No comment:No commentsuccess1.667 - Analysed & RejectedNaN-7.474369x0678NaNactive-noncovalent
1CC(C)Nc1cccnc1MAK-UNK-2c1752f0-4Maksym Voznyyx1093https://covid.postera.ai/covid/submissions/MAK...FALSEZ2574930241EN300-56005FALSEFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse136.1981.90191224.920PASSPASSPASSPASSPASSPASSPASSPASSPASSFalseFalseFalsex0572x0572Z1259341037NaNFalseOK: No comment:No commentsuccess2.987 - Analysed & RejectedNaN-7.474369x0678NaNactive-noncovalent
2CCS(=O)(=O)Nc1ccccc1FMAK-UNK-2c1752f0-5Maksym Voznyyx1093https://covid.postera.ai/covid/submissions/MAK...FALSEZ53825177EN300-116204FALSEFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse203.2381.58731246.170PASSPASSPASSPASSPASSHetero_heteroPASSPASSPASSFalseFalseFalsex0247x0247Z53825177NaNFalseOK: No comment:No commentsuccess1.837 - Analysed & RejectedNaN-7.413380x0678NaNactive-noncovalent

Comparing other databases

CHEMBL, DrugBank, and “EDrug”(?) look to be the 3 prefixes in the “TITLE” column

from chembl_webresource_client.new_client import new_client
molecule = new_client.molecule
res = molecule.search('CHEMBL1387')
res_df = pd.DataFrame.from_dict(res)
res_df.columns
Index(['atc_classifications', 'availability_type', 'biotherapeutic',
       'black_box_warning', 'chebi_par_id', 'chirality', 'cross_references',
       'dosed_ingredient', 'first_approval', 'first_in_class', 'helm_notation',
       'indication_class', 'inorganic_flag', 'max_phase', 'molecule_chembl_id',
       'molecule_hierarchy', 'molecule_properties', 'molecule_structures',
       'molecule_synonyms', 'molecule_type', 'natural_product', 'oral',
       'parenteral', 'polymer_flag', 'pref_name', 'prodrug', 'score',
       'structure_type', 'therapeutic_flag', 'topical', 'usan_stem',
       'usan_stem_definition', 'usan_substem', 'usan_year', 'withdrawn_class',
       'withdrawn_country', 'withdrawn_flag', 'withdrawn_reason',
       'withdrawn_year'],
      dtype='object')
res_df[['chirality', 'molecule_properties', 'molecule_structures', 'score']]
chiralitymolecule_propertiesmolecule_structuresscore
01{'alogp': '3.64', 'aromatic_rings': 0, 'cx_log...{'canonical_smiles': 'C#C[C@]1(O)CC[C@H]2[C@@H...17.0
res_df[['molecule_properties']].values[0]
array([{'alogp': '3.64', 'aromatic_rings': 0, 'cx_logd': '2.81', 'cx_logp': '2.81', 'cx_most_apka': None, 'cx_most_bpka': None, 'full_molformula': 'C20H26O2', 'full_mwt': '298.43', 'hba': 2, 'hba_lipinski': 2, 'hbd': 1, 'hbd_lipinski': 1, 'heavy_atoms': 22, 'molecular_species': None, 'mw_freebase': '298.43', 'mw_monoisotopic': '298.1933', 'num_lipinski_ro5_violations': 0, 'num_ro5_violations': 0, 'psa': '37.30', 'qed_weighted': '0.55', 'ro3_pass': 'N', 'rtb': 0}],
      dtype=object)
res_df['molecule_properties'].apply(pd.Series)
alogparomatic_ringscx_logdcx_logpcx_most_apkacx_most_bpkafull_molformulafull_mwthbahba_lipinskihbdhbd_lipinskiheavy_atomsmolecular_speciesmw_freebasemw_monoisotopicnum_lipinski_ro5_violationsnum_ro5_violationspsaqed_weightedro3_passrtb
03.6402.812.81NoneNoneC20H26O2298.43221122None298.43298.19330037.300.55N0
all_results = [molecule.search(a) for a in repurposing_df['TITLE']]

Here’s a big Python function tangent.

For each chembl molecule, we’ve searched for it within the chembl, returning us a list (of length 1) containing a dictionary of properties.

All molecules have been compiled into a list, so we have a list of lists of dicionatires.

For sanity, we can use a Python filter to only retain the non-None results.

We can chain that with a Python map function to parse the first item from each molecule’s list. Recall, each molecule was a list with just one element, a dictionary. We can boil this down to only returning the dictionary (eliminating the list wrapper).

For validation, I’ve called next to look at the results

filtered = map(lambda x: x[0], filter(lambda x: x is not None, all_results))
next(filtered)
{'atc_classifications': [],
 'availability_type': -1,
 'biotherapeutic': None,
 'black_box_warning': 0,
 'chebi_par_id': None,
 'chirality': 0,
 'cross_references': [],
 'dosed_ingredient': False,
 'first_approval': None,
 'first_in_class': 0,
 'helm_notation': None,
 'indication_class': 'Anti-Inflammatory',
 'inorganic_flag': 0,
 'max_phase': 0,
 'molecule_chembl_id': 'CHEMBL2104122',
 'molecule_hierarchy': {'molecule_chembl_id': 'CHEMBL2104122',
  'parent_chembl_id': 'CHEMBL2104122'},
 'molecule_properties': {'alogp': '3.45',
  'aromatic_rings': 2,
  'cx_logd': '1.26',
  'cx_logp': '3.92',
  'cx_most_apka': '4.68',
  'cx_most_bpka': None,
  'full_molformula': 'C16H14O2',
  'full_mwt': '238.29',
  'hba': 1,
  'hba_lipinski': 2,
  'hbd': 1,
  'hbd_lipinski': 1,
  'heavy_atoms': 18,
  'molecular_species': 'ACID',
  'mw_freebase': '238.29',
  'mw_monoisotopic': '238.0994',
  'num_lipinski_ro5_violations': 0,
  'num_ro5_violations': 0,
  'psa': '37.30',
  'qed_weighted': '0.74',
  'ro3_pass': 'N',
  'rtb': 2},
 'molecule_structures': {'canonical_smiles': 'CC(C(=O)O)c1ccc2c(c1)Cc1ccccc1-2',
  'molfile': '\n     RDKit          2D\n\n 18 20  0  0  0  0  0  0  0  0999 V2000\n   -0.5375    0.0250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -0.5375    1.1083    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.4458    1.1083    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.4458    0.0250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    1.3625    0.0250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.4875   -0.5125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.4125   -0.5125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    3.3292    0.0250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.4125    1.6500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    2.3417   -0.5292    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    1.3625    1.1083    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    3.3500    1.1958    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0\n    4.2167   -0.6292    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.3958    1.6500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.3958   -0.5125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    2.3417   -1.6417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -4.3458    1.1083    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -4.3458    0.0250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n  2  1  2  0\n  3  2  1  0\n  4  6  1  0\n  5  7  2  0\n  6  1  1  0\n  7  1  1  0\n  8 10  1  0\n  9  2  1  0\n 10  5  1  0\n 11  5  1  0\n 12  8  2  0\n 13  8  1  0\n 14  3  1  0\n 15  4  1  0\n 16 10  1  0\n 17 14  2  0\n 18 15  2  0\n  3  4  2  0\n  9 11  2  0\n 17 18  1  0\nM  END\n\n> <chembl_id>\nCHEMBL2104122\n\n> <chembl_pref_name>\nCICLOPROFEN\n\n',
  'standard_inchi': 'InChI=1S/C16H14O2/c1-10(16(17)18)11-6-7-15-13(8-11)9-12-4-2-3-5-14(12)15/h2-8,10H,9H2,1H3,(H,17,18)',
  'standard_inchi_key': 'LRXFKKPEBXIPMW-UHFFFAOYSA-N'},
 'molecule_synonyms': [{'molecule_synonym': 'Cicloprofen',
   'syn_type': 'BAN',
   'synonyms': 'CICLOPROFEN'},
  {'molecule_synonym': 'Cicloprofen',
   'syn_type': 'INN',
   'synonyms': 'CICLOPROFEN'},
  {'molecule_synonym': 'Cicloprofen',
   'syn_type': 'USAN',
   'synonyms': 'CICLOPROFEN'},
  {'molecule_synonym': 'SQ-20824',
   'syn_type': 'RESEARCH_CODE',
   'synonyms': 'SQ 20824'}],
 'molecule_type': 'Small molecule',
 'natural_product': 0,
 'oral': False,
 'parenteral': False,
 'polymer_flag': False,
 'pref_name': 'CICLOPROFEN',
 'prodrug': 0,
 'score': 16.0,
 'structure_type': 'MOL',
 'therapeutic_flag': False,
 'topical': False,
 'usan_stem': '-profen',
 'usan_stem_definition': 'anti-inflammatory/analgesic agents (ibuprofen type)',
 'usan_substem': '-profen',
 'usan_year': 1974,
 'withdrawn_class': None,
 'withdrawn_country': None,
 'withdrawn_flag': False,
 'withdrawn_reason': None,
 'withdrawn_year': None}

For now, I’m only really interested in the molecule_properties dictionary

filtered = [a[0]['molecule_properties'] for a in all_results if len(a) > 0]
chembl_df = pd.DataFrame(filtered)
chembl_df['TITLE'] = repurposing_df['TITLE']

Molecular properties contained in the chembl database

Here are the definitions I can dig up

  • alogp: (lipophilicity) partition coefficient
  • aromatic_rings: number of aromatic rings
  • cx_logd: distribution coefficient taking into account ionized and non-ionized forms
  • cx_most_apka: acidic pka
  • cx_most_bpka: basic pka
  • full_mwt: molecular weight (and also free base and monoisotopic masses)
  • hba: hydrogen bond acceptors (and hba_lipinski for lipinski definitiosn)
  • hbd: hydrogen bond donors (and hbd_lipinski)
  • heavy_atoms: number of heavy atoms
  • num_lipinski_ro5_violations: how many times this molecule violated Lipinski’s rule of five
  • num_ro5_violations: not sure, seems similar to lipinski rule of 5
  • psa: protein sequence alignment
  • qed_weighted: “quantitative estimate of druglikeness” (ranges between 0 and 1, with 1 being more favorable). This is based on a quantitatve mean of drugability functions
  • ro3_pass: rule of three
  • rtb: number of rotatable bonds
chembl_df.head()
alogparomatic_ringscx_logdcx_logpcx_most_apkacx_most_bpkafull_molformulafull_mwthbahba_lipinskihbdhbd_lipinskiheavy_atomsmolecular_speciesmw_freebasemw_monoisotopicnum_lipinski_ro5_violationsnum_ro5_violationspsaqed_weightedro3_passrtbTITLE
03.452.01.263.924.68NoneC16H14O2238.291.02.01.01.018.0ACID238.29238.09940.00.037.300.74N2.0CHEMBL2104122
13.640.02.812.81NoneNoneC20H26O2298.432.02.01.01.022.0None298.43298.19330.00.037.300.55N0.0CHEMBL1387
23.921.04.254.2510.152.86C18H24N2O2S332.474.04.02.03.023.0NEUTRAL332.47332.15580.00.075.680.76N1.0CHEMBL275835
34.310.04.044.04NoneNoneC20H28O284.441.01.01.01.021.0None284.44284.21400.00.020.230.52N0.0CHEMBL2104104
44.790.03.963.96NoneNoneC21H28O2312.452.02.00.00.023.0None312.45312.20890.00.034.140.70N1.0CHEMBL2104231
chembl_df.columns
Index(['alogp', 'aromatic_rings', 'cx_logd', 'cx_logp', 'cx_most_apka',
       'cx_most_bpka', 'full_molformula', 'full_mwt', 'hba', 'hba_lipinski',
       'hbd', 'hbd_lipinski', 'heavy_atoms', 'molecular_species',
       'mw_freebase', 'mw_monoisotopic', 'num_lipinski_ro5_violations',
       'num_ro5_violations', 'psa', 'qed_weighted', 'ro3_pass', 'rtb',
       'TITLE'],
      dtype='object')
chembl_df.corr()
aromatic_ringshbahba_lipinskihbdhbd_lipinskiheavy_atomsnum_lipinski_ro5_violationsnum_ro5_violationsrtb
aromatic_rings1.0000000.1925690.1785070.0149280.0361060.2490220.0310940.0310940.229124
hba0.1925691.0000000.8688590.0845530.0544090.451560-0.047705-0.047705-0.023690
hba_lipinski0.1785070.8688591.0000000.3486000.2942760.295864-0.070783-0.0707830.021812
hbd0.0149280.0845530.3486001.0000000.935710-0.172866-0.060462-0.0604620.040505
hbd_lipinski0.0361060.0544090.2942760.9357101.000000-0.211899-0.085660-0.0856600.084225
heavy_atoms0.2490220.4515600.295864-0.172866-0.2118991.0000000.3972400.3972400.259011
num_lipinski_ro5_violations0.031094-0.047705-0.070783-0.060462-0.0856600.3972401.0000001.0000000.345308
num_ro5_violations0.031094-0.047705-0.070783-0.060462-0.0856600.3972401.0000001.0000000.345308
rtb0.229124-0.0236900.0218120.0405050.0842250.2590110.3453080.3453081.000000

At a glance, no definite linear correlations among this crowd besides pKas, partition coefficients, mwt/hba

corr_df = chembl_df.corr()
cols = chembl_df.columns

fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)

ax.imshow(chembl_df.corr(), cmap='RdBu')

ax.set_xticklabels(['']+cols)
ax.tick_params(axis='x', rotation=90)

ax.set_yticklabels(cols)

for i, (rowname, row) in enumerate(corr_df.iterrows()):
    for j, (key, val) in enumerate(row.iteritems()):
        ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")

png

Maybe there are higher-order correlations and relationship more appropriate for clustering and decomposition

cols = ['aromatic_rings', 'cx_logp',  'full_mwt', 'hba']
cleaned = (chembl_df[~chembl_df[cols]
                     .isnull()
                     .all(axis='columns', skipna=False)][cols]
           .astype('float')
           .fillna(0, axis='columns'))
from sklearn import preprocessing

normalized = preprocessing.scale(cleaned)

Appears to be maybe 4 clusters of these compounds examined by the covid-moonshot group

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

tsne_analysis = TSNE(n_components=2)
output = tsne_analysis.fit_transform(normalized)
fig,ax = plt.subplots(1,1)
ax.scatter(output[:,0], output[:,1])
ax.set_title("Aromatic rings, cx_logp, mwt, hba")
Text(0.5, 1.0, 'Aromatic rings, cx_logp, mwt, hba')

png

By taking turns leaving out some features, it looks like leaving out aromatic rings or hydrogen bond acceptors will diminish the cluster distinction.

Aromatic rings are huge and bulky components to small molecules, it makes sense that a chunk of the behavior corresponds to the aromatic rings. Similarly, hydrogen bond acceptors (heavy molecules) also induce van der Waals and electrostatics influences on small molecules. Left with only weight and partition coefficient, there’s mainly a continous behavior

def clean_df(cols):
    cleaned = (chembl_df[~chembl_df[cols]
                     .isnull()
                     .all(axis='columns', skipna=False)][cols]
           .astype('float')
           .fillna(0, axis='columns'))

    normalized = preprocessing.scale(cleaned)
    
    return normalized

cols = ['cx_logp',  'full_mwt', 'hba']
normalized = clean_df(cols)
tsne_analysis = TSNE(n_components=2)
output = tsne_analysis.fit_transform(normalized)
fig,ax = plt.subplots(3,1, figsize=(8,8))
ax[0].scatter(output[:,0], output[:,1])
ax[0].set_title("cx_logp, mwt, hba")

cols = ['cx_logp',  'full_mwt', 'aromatic_rings']
normalized = clean_df(cols)

tsne_analysis = TSNE(n_components=2)
output = tsne_analysis.fit_transform(normalized)

ax[1].scatter(output[:,0], output[:,1])
ax[1].set_title("aromatic_rings, cx_logp, mwt")

cols = ['cx_logp',  'full_mwt']
normalized = clean_df(cols)

ax[2].scatter(normalized[:,0], normalized[:,1])
ax[2].set_title("cx_logp, mwt")

fig.tight_layout()

png

DrugBank

I found someone had already downloaded the database. I may double-over these dataframes, but query the drugbank dataset rather than chembl

Some docking data

We have some smiles strings, molecular properties, docking scores, and information about the docking fragments

moonshot = pd.read_csv('moonshot-submissions/covid_submissions_all_info-docked-overlap.csv')
moonshot
SMILESTITLEcreatorfragmentslinkreal_spaceSCRBBextended_real_spacein_molport_or_mculein_ultimate_mculein_emoleculescovalent_fragcovalent_warheadacrylamideacrylamide_adductchloroacetamidechloroacetamide_adductvinylsulfonamidevinylsulfonamide_adductnitrilenitrile_adductMWcLogPHBDHBATPSAnum_criterion_violationsBMSDundeeGlaxoInpharmaticaLINTMLSMRPAINSSureChEMBLPostEraORDEREDMADEASSAYEDHybrid2docked_fragmentMpro-x1418_docksitenumber_of_overlapping_fragmentsoverlapping_fragmentsoverlap_scorevolume
0c1ccc(cc1)n2c3cc(c(cc3c(=O)c(c2[O-])c4cccnc4)F)ClMAK-UNK-9e4a73aa-2Maksym Voznyyx1418https://covid.postera.ai/covid/submissions/MAK...FALSEFALSEFALSEFALSEFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalse366.7794.518900350.270PASSbeta-keto/anhydridePASSPASSPASSKetone, Dye 11PASSPASSPASSFalseFalseFalse-11.881256x14181.206534active-covalent3x0434,x0678,x08303.208124271.986084
1Cc1ccncc1n2c(=O)ccc3c2CCCN3CC(=[NH2+])NKIM-UNI-60f168f5-7Kim Tai Tran, University of Copenhagenx0107,x0991https://covid.postera.ai/covid/submissions/KIM...FALSEFALSEFALSEFALSEFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse297.3621.229492588.000PASSimine, iminePASSPASSacyclic C=N-HImine 3PASSPASSPASSFalseFalseFalse-11.654112x0107NaNactive-noncovalent3x0107,x1412,x13924.753475232.815506
2c1ccc(cc1)n2c3cc(c(cc3c(=O)n(c2=O)c4cnccn4)F)ClMAK-UNK-9e4a73aa-14Maksym Voznyyx1418https://covid.postera.ai/covid/submissions/MAK...FALSEFALSEFALSEFALSEFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalse368.7552.724100669.780PASSPASSPASSPASSPASSPASSPASSPASSPASSFalseFalseFalse-10.460650x06782.716276active-noncovalent3x0678,x1412,x13925.520980266.688721
3Cc1ccncc1N(C=C)[C@H]([C@@H](C)[C@@H]2CN=Cc3c2c...AUS-WAB-916db9c0-1Austin D. Chivington, Wabash Collegex0107,x1077,x1374https://covid.postera.ai/covid/submissions/AUS...FALSEFALSEFALSEFALSEFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalse351.4503.519321557.950non_ring_acetalhet-C-het not in ringPASSFilter10_Terminal_vinylPASSPASSPASSPASSPASSFalseFalseFalse-9.516450x0678NaNactive-noncovalent3x0434,x0831,x06783.446572284.195312
4c1ccc2c(c1)ncc(n2)/C=C/C(=O)c3cccc(c3)ODRV-DNY-ae159ed1-12Dr. Vidya Desai, Dnyanprassarak Mandals Colleg...x1249https://covid.postera.ai/covid/submissions/DRV...FALSEFALSEFALSEFALSEFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse276.2953.231501463.080PASSPASSPASSFilter44_michael_acceptor2PASSKetone, Dye 9, vinyl michael acceptor1PASSPASSPASSFalseFalseFalse-9.243208x0678NaNactive-noncovalent3x0434,x0678,x08302.865147220.275421
...................................................................................................................................................
4630C[C@H]([C@@H](C(=O)N[C@H](Cc1ccccc1)C(=O)N[C@@...PAU-UNI-6d15a9f5-4paul brear, University of cambridgex1086https://covid.postera.ai/covid/submissions/PAU...FALSEFALSEFALSEFALSEFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse714.821-0.91270811256.104PASSPASSPASSPASSPASSLong aliphatic chain, DipeptidePASSPASSPASSFalseFalseFalse3.175111x0305NaNactive-noncovalent0NaN5.297134548.583191
4631c1cc2cc(c(cc2c(c1)S(=O)(=O)N3CC[NH+](CC3)Cc4cc...MAK-UNK-e05327b2-2Maksym Voznyyx1402https://covid.postera.ai/covid/submissions/MAK...FALSEFALSEFALSEFALSEFalseFalseFalseTrueTrueFalseFalseTrueFalseFalseFalseFalseFalse837.9646.631900998.312PASSPASSPASSPASSPASSHetero_heteroPASSPASSPASSFalseFalseFalse3.561681x1392NaNactive-covalent0NaN3.297014591.877563
4632Cc1cccc(c1)C[NH+]2CCN(CC2)C(=O)c3ccc(cc3)C#Cc4...MAK-UNK-e4a48a85-16Maksym Voznyyx0387,x0692https://covid.postera.ai/covid/submissions/MAK...FALSEFALSEFALSEFALSEFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalse574.7946.188920539.682PASStriple bondPASSPASSPASSPASSPASSPASSPASSFalseFalseFalse4.056698x0978NaNactive-covalent0NaN4.360606470.944824
4633c1cc2cc(c(cc2c(c1)S(=O)(=O)N3CC[NH+](CC3)Cc4cc...MAK-UNK-e05327b2-6Maksym Voznyyx1402https://covid.postera.ai/covid/submissions/MAK...FALSEFALSEFALSEFALSEFalseFalseFalseTrueTrueFalseFalseTrueFalseFalseFalseFalseFalse990.1835.19160012138.933alpha_halo_heteroatom, secondary_halide_sulfatePASSPASSPASSPASSHetero_heteroPASSDithiomethylene_acetalAlkyl HalideFalseFalseFalse4.242827x0731NaNactive-covalent0NaN4.193186694.333069
4634Cc1cccc(c1)C[NH+]2CCN(CC2)c3cc(c(c(c3)Cl)c4cc5...MAK-UNK-e4a48a85-15Maksym Voznyyx0387,x0692https://covid.postera.ai/covid/submissions/MAK...FALSEFALSEFALSEFALSEFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalse659.6877.363621768.362PASSPASSPASSPASSPASSPASSPASSPASSPASSFalseFalseFalse5.966927x0705NaNactive-covalent0NaN1.473711503.583801

4635 rows × 48 columns

moonshot.head(5)
SMILESTITLEcreatorfragmentslinkreal_spaceSCRBBextended_real_spacein_molport_or_mculein_ultimate_mculein_emoleculescovalent_fragcovalent_warheadacrylamideacrylamide_adductchloroacetamidechloroacetamide_adductvinylsulfonamidevinylsulfonamide_adductnitrilenitrile_adductMWcLogPHBDHBATPSAnum_criterion_violationsBMSDundeeGlaxoInpharmaticaLINTMLSMRPAINSSureChEMBLPostEraORDEREDMADEASSAYEDHybrid2docked_fragmentMpro-x1418_docksitenumber_of_overlapping_fragmentsoverlapping_fragmentsoverlap_scorevolume
0c1ccc(cc1)n2c3cc(c(cc3c(=O)c(c2[O-])c4cccnc4)F)ClMAK-UNK-9e4a73aa-2Maksym Voznyyx1418https://covid.postera.ai/covid/submissions/MAK...FALSEFALSEFALSEFALSEFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalse366.7794.518900350.270PASSbeta-keto/anhydridePASSPASSPASSKetone, Dye 11PASSPASSPASSFalseFalseFalse-11.881256x14181.206534active-covalent3x0434,x0678,x08303.208124271.986084
1Cc1ccncc1n2c(=O)ccc3c2CCCN3CC(=[NH2+])NKIM-UNI-60f168f5-7Kim Tai Tran, University of Copenhagenx0107,x0991https://covid.postera.ai/covid/submissions/KIM...FALSEFALSEFALSEFALSEFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse297.3621.229492588.000PASSimine, iminePASSPASSacyclic C=N-HImine 3PASSPASSPASSFalseFalseFalse-11.654112x0107NaNactive-noncovalent3x0107,x1412,x13924.753475232.815506
2c1ccc(cc1)n2c3cc(c(cc3c(=O)n(c2=O)c4cnccn4)F)ClMAK-UNK-9e4a73aa-14Maksym Voznyyx1418https://covid.postera.ai/covid/submissions/MAK...FALSEFALSEFALSEFALSEFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalse368.7552.724100669.780PASSPASSPASSPASSPASSPASSPASSPASSPASSFalseFalseFalse-10.460650x06782.716276active-noncovalent3x0678,x1412,x13925.520980266.688721
3Cc1ccncc1N(C=C)[C@H]([C@@H](C)[C@@H]2CN=Cc3c2c...AUS-WAB-916db9c0-1Austin D. Chivington, Wabash Collegex0107,x1077,x1374https://covid.postera.ai/covid/submissions/AUS...FALSEFALSEFALSEFALSEFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalse351.4503.519321557.950non_ring_acetalhet-C-het not in ringPASSFilter10_Terminal_vinylPASSPASSPASSPASSPASSFalseFalseFalse-9.516450x0678NaNactive-noncovalent3x0434,x0831,x06783.446572284.195312
4c1ccc2c(c1)ncc(n2)/C=C/C(=O)c3cccc(c3)ODRV-DNY-ae159ed1-12Dr. Vidya Desai, Dnyanprassarak Mandals Colleg...x1249https://covid.postera.ai/covid/submissions/DRV...FALSEFALSEFALSEFALSEFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse276.2953.231501463.080PASSPASSPASSFilter44_michael_acceptor2PASSKetone, Dye 9, vinyl michael acceptor1PASSPASSPASSFalseFalseFalse-9.243208x0678NaNactive-noncovalent3x0434,x0678,x08302.865147220.275421
moonshot['Mpro-x1418_dock'].isnull().sum() # Lots of missing Mpro dock scores
4586

While there are a lot of different fragments to which the small molecule can bind, there are two “classes”, active-covalent and active-noncovalent (possibly referring to sites that covalently bond?)

This presents a way to logically bisect the data based on some fundamental chemistry of the binding pocket.

moonshot['docked_fragment'].value_counts()
x0678    940
x0749    771
x0104    347
x0831    283
x0830    281
x0195    269
x0161    252
x0107    201
x0072    172
x1077    127
x1392    107
x1093    107
x0434    105
x0874     81
x1385     69
x1418     58
x1334     50
x0967     46
x0397     42
x0946     38
x0692     37
x0759     37
x1386     35
x0395     29
x0305     24
x1311     16
x0708     13
x0774     12
x1380     10
x1412      7
x1374      7
x1348      6
x0770      5
x1249      5
x0387      5
x0736      4
x0705      4
x1358      3
x0426      3
x1375      3
x0734      3
x0540      3
x0354      3
x1382      3
x0755      1
x1458      1
x0689      1
x0769      1
x0981      1
x0978      1
x0731      1
x1493      1
x0771      1
x1478      1
x1384      1
x1351      1
Name: docked_fragment, dtype: int64
moonshot['site'].value_counts()
active-noncovalent    2799
active-covalent       1836
Name: site, dtype: int64

We can examine the same correlations, but now for each type of site, and look at the hybrid docking score correlations.

The biggest trend differences appear with the partition coefficient and number of hydrogen bond donors, but still the correlations are extremely weak

site_type = 'active-noncovalent'
fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
cols = ['MW', 'cLogP', 'HBD', 'HBA', 'TPSA', 'Hybrid2']
ax.matshow(moonshot[moonshot['site']==site_type][cols].corr(), cmap='RdBu')

ax.set_xticks([i for i,_ in enumerate(cols)])
ax.set_xticklabels(cols)

ax.set_yticks([i for i,_ in enumerate(cols)])
ax.set_yticklabels(cols)

for i, (rowname, row) in enumerate(moonshot[moonshot['site']==site_type][cols].corr().iterrows()):
    for j, (key, val) in enumerate(row.iteritems()):
        ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")
ax.set_title(f"Docking to {site_type}")
Text(0.5, 1.05, 'Docking to active-noncovalent')

png

site_type = 'active-covalent'
fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
cols = ['MW', 'cLogP', 'HBD', 'HBA', 'TPSA', 'Hybrid2']
ax.matshow(moonshot[moonshot['site']==site_type][cols].corr(), cmap='RdBu')

ax.set_xticks([i for i,_ in enumerate(cols)])
ax.set_xticklabels(cols)

ax.set_yticks([i for i,_ in enumerate(cols)])
ax.set_yticklabels(cols)

for i, (rowname, row) in enumerate(moonshot[moonshot['site']==site_type][cols].corr().iterrows()):
    for j, (key, val) in enumerate(row.iteritems()):
        ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")
ax.set_title(f"Docking to {site_type}")
Text(0.5, 1.05, 'Docking to active-covalent')

png

In general, lower docking score seem better, so the noncovalent sites might present more optimal binding locations (see histogram below). This seems non-intuitive because, if active-covalent really means sites that bond covalently, then covalent bonds would seem more energetically favorable than non-covalent interactions. Alternatively, forming covalent bonds might suggest an unstable region of the complex that could be shielded from the surroundings, inhibiting any sort of small molecule from binding the pocket? Expert opinion would be much appreciated here


fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
covalent_mean = moonshot[moonshot['site']=='active-covalent']['Hybrid2'].mean()
noncovalent_mean = moonshot[moonshot['site']=='active-noncovalent']['Hybrid2'].mean()

ax.hist(moonshot[moonshot['site']=='active-covalent']['Hybrid2'], alpha=0.5, 
        label=f'active-covalent (mean={covalent_mean:.3f})')
ax.hist(moonshot[moonshot['site']=='active-noncovalent']['Hybrid2'], alpha=0.5, 
        label=f'active-noncovalent (mean={noncovalent_mean:.3f})')

ax.set_title(f"Hybrid2 histogram")
ax.set_xlabel("Hybrid2 score")
ax.legend()
<matplotlib.legend.Legend at 0x7fac6b459850>

png

from rdkit import Chem
rdkit_smiles = [Chem.MolFromSmiles(a) for a in moonshot.sort_values('Hybrid2', ascending=True)['SMILES'].head(10)]
scores = [f"{a:.3f}" for a in moonshot.sort_values('Hybrid2', ascending=True)['Hybrid2'].head(10)]
img=Chem.Draw.MolsToGridImage(rdkit_smiles,molsPerRow=5,subImgSize=(200,200),
                             legends=scores)
img

png