Digging through some Folding@Home data

32 minute read

Published: May 06, 2020

Learning cheminformatics from some Folding@Home data

png

Top 10 (based on Hybrid2 docking score) small molecules

2020-05-06 - 2020-05-11

I have no formal training in cheminformatics, so I am going to be stumbling and learning as I wade through this dataset. I welcome any learning lessons from experts.

This will be an ongoing foray

Source: https://github.com/FoldingAtHome/covid-moonshot

Introduction

Folding@Home is a distributed computing project - allowing molecular simulations to be run in parallel across thousands of different computers with minimal communication. This, combined with other molecular modeling methods, has yielded a lot of open data for others to examine. In particular, I’m interested in the docking screens and compounds targeted by the F@H and postera collaborations

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
pd.options.display.max_columns = 999

moonshot_df = pd.read_csv('moonshot-submissions/covid_submissions_all_info.csv')

moonshot_df.head()

	SMILES	CID	creator	fragments	link	real_space	SCR	BB	extended_real_space	in_molport_or_mcule	in_ultimate_mcule	in_emolecules	covalent_frag	covalent_warhead	acrylamide	acrylamide_adduct	chloroacetamide	chloroacetamide_adduct	vinylsulfonamide	vinylsulfonamide_adduct	nitrile	nitrile_adduct	MW	cLogP	HBD	HBA	TPSA	BMS	Dundee	Glaxo	Inpharmatica	LINT	MLSMR	PAINS	SureChEMBL	PostEra	ORDERED	MADE	ASSAYED
0	CCN(Cc1cccc(-c2ccncc2)c1)C(=O)Cn1nnc2ccccc21	AAR-POS-8a4e0f60-1	Aaron Morris, PostEra	x0072	https://covid.postera.ai/covid/submissions/AAR...	Z1260533612	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	371.444	3.5420	0	5	63.91	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	True	False	False
1	O=C(Cn1nnc2ccccc21)NCc1ccc(Oc2cccnc2)c(F)c1	AAR-POS-8a4e0f60-10	Aaron Morris, PostEra	x0072	https://covid.postera.ai/covid/submissions/AAR...	Z826180044	FALSE	FALSE	s_22____1723102____13206668	False	False	False	False	False	False	False	False	False	False	False	False	False	377.379	3.0741	1	6	81.93	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	True	False	False
2	CN(Cc1nnc2ccccn12)C(=O)N(Cc1cccs1)c1ccc(Br)cc1	AAR-POS-8a4e0f60-11	Aaron Morris, PostEra	x0072	https://covid.postera.ai/covid/submissions/AAR...	FALSE	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	456.369	4.8119	0	5	53.74	PASS	PASS	PASS	Filter9_metal	aryl bromide	PASS	PASS	PASS	PASS	True	False	False
3	CCN(Cc1cccc(-c2ccncc2)c1)C(=O)Cc1noc2ccccc12	AAR-POS-8a4e0f60-2	Aaron Morris, PostEra	x0072	https://covid.postera.ai/covid/submissions/AAR...	Z1260535907	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	371.440	4.4810	0	4	59.23	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	True	False	False
4	O=C(NCc1noc2ccccc12)N(Cc1cccs1)c1ccc(F)cc1	AAR-POS-8a4e0f60-3	Aaron Morris, PostEra	x0072	https://covid.postera.ai/covid/submissions/AAR...	FALSE	FALSE	FALSE	s_272164____9388766____17338746	False	False	False	False	False	False	False	False	False	False	False	False	False	381.432	4.9448	1	4	58.37	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	True	False	False

The moonshot data has a lot of logging/metadata information, some one-hot-encoding information about functional groups, and some additional columns about Glaxo, Dundee, BMS, Lint, PAINS, SureChEMBL - I’m not sure what those additional coluns mean, but the values are binary values, possibly the results of some other test or availability in another databases.

I’m going to focus on the molecular properties: MW, cLogP, HBD, HBA, TPSA

MW: Molecular Weight
cLogP: The logarithm of the partition coefficient (ratio of concentrations in octanol vs water, $\log{\frac{c_{octanol}}{c_{water}}}$)
HBD: Hydrogen bond donors
HBA: Hydrogen bond acceptors
TPSA: Topological polar surface area

Some of the correlations make some chemical sense - heavier molecules have more heavy atoms (O, N, F, etc.), but these heavier atoms are also the hydrogen bond acceptors. By that logic, more heavy atoms also coincides with more electronegative atoms, increasing your TPSA. It’s a little convoluted because TPSA looks at the surface, not necessarily the volume of the compound; geometry/shape will influence TPSA. There don’t appear to be any strong correlations with cLogP. Partition coefficients are a complex function of polarity, size/sterics, and shape - a 1:1 correlation with a singular, other variable will be hard to pinpoint

This csv file doesn’t have much other numerical data, but maybe some of those true/false, pass/fail data might be relevant…but I definitely need more context here

fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
cols = ['MW', 'cLogP', 'HBD', 'HBA', 'TPSA']
ax.matshow(moonshot_df[cols].corr(), cmap='RdBu')

ax.set_xticks([i for i,_ in enumerate(cols)])
ax.set_xticklabels(cols)

ax.set_yticks([i for i,_ in enumerate(cols)])
ax.set_yticklabels(cols)

for i, (rowname, row) in enumerate(moonshot_df[cols].corr().iterrows()):
    for j, (key, val) in enumerate(row.iteritems()):
        ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")

png

Some docking results

Okay here’s a couple other CSVs I found, these include some docking scores

Repurposing scores: “The Drug Repurposing Hub is a curated and annotated collection of FDA-approved drugs, clinical trial drugs, and pre-clinical tool compounds with a companion information resource” source here, so a public dataset of some drugs
Redock scores: “This directory contains experiments in redocking all screened fragments into the entire ensemble of X-ray structures.” Taking fragments and re-docking them

repurposing_df = pd.read_csv('repurposing-screen/drugset-docked.csv')
redock_df = pd.read_csv('redock-fragments/all-screened-fragments-docked.csv')

SMILES strings, names, docking scores

repurposing_df.head()

	SMILES	TITLE	Hybrid2	docked_fragment	Mpro-_dock	site
0	C[C@@H](c1ccc-2c(c1)Cc3c2cccc3)C(=O)[O-]	CHEMBL2104122	-11.519580	x0749	0.509349	active-covalent
1	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@]2(C#C)O)CCC4...	CHEMBL1387	-10.580162	x0749	2.706928	active-covalent
2	CC(C)(C)c1cc(cc(c1O)C(C)(C)C)/C=C\2/C(=O)NC(=[...	CHEMBL275835	-10.557229	x0107	1.801830	active-noncovalent
3	C[C@]12CC[C@@H]3[C@H]4CCCCC4=CC[C@H]3[C@@H]1CC...	CHEMBL2104104	-10.480992	x0749	3.791700	active-covalent
4	CC(=O)[C@]1(CC[C@@H]2[C@@]1(CCC3=C4CCC(=O)C=C4...	CHEMBL2104231	-10.430775	x0749	4.230903	active-covalent

Hybrid2 looks like a docking method provided via OpenEye. Mpro likely refers to COVID-19 main protease. I’m not entirely sure what the receptor for “Hybrid2” is, but there seem to be multiple “sites” or “fragments” for docking. There are lots of different fragments, but very few sites. For each site-fragment combination, multiple small molecules may have been tested.

repurposing_df['docked_fragment'].value_counts()

x0195    114
x0749     69
x0678     58
x0397     45
x0104     24
x0161     21
x1077     19
x0072     14
x0874     13
x0354     13
x0689     10
x1382      7
x0708      4
x0434      4
x1093      3
x1392      2
x0395      2
x1402      2
x0831      2
x0107      2
x1385      2
x1418      2
x0387      2
x0830      2
x1478      1
x0786      1
x1187      1
x0692      1
x0967      1
x0426      1
x0305      1
x0946      1
x1386      1
x0759      1
Name: docked_fragment, dtype: int64

repurposing_df['site'].value_counts()

active-noncovalent    338
active-covalent       107
dimer-interface         1
Name: site, dtype: int64

repurposing_df.groupby(["docked_fragment", "site"]).count()

		SMILES	TITLE	Hybrid2	Mpro-_dock
docked_fragment	site
x0072	active-noncovalent	14	14	14	14
x0104	active-noncovalent	24	24	24	24
x0107	active-noncovalent	2	2	2	2
x0161	active-noncovalent	21	21	21	21
x0195	active-noncovalent	114	114	114	114
x0305	active-noncovalent	1	1	1	1
x0354	active-noncovalent	13	13	13	13
x0387	active-noncovalent	2	2	2	2
x0395	active-noncovalent	2	2	2	2
x0397	active-noncovalent	45	45	45	45
x0426	active-noncovalent	1	1	1	1
x0434	active-noncovalent	4	4	4	4
x0678	active-noncovalent	58	58	58	58
x0689	active-covalent	10	10	10	10
x0692	active-covalent	1	1	1	1
x0708	active-covalent	4	4	4	4
x0749	active-covalent	69	69	69	69
x0759	active-covalent	1	1	1	1
x0786	active-covalent	1	1	1	1
x0830	active-covalent	2	2	2	2
x0831	active-covalent	2	2	2	2
x0874	active-noncovalent	13	13	13	13
x0946	active-noncovalent	1	1	1	1
x0967	active-noncovalent	1	1	1	1
x1077	active-noncovalent	19	19	19	19
x1093	active-noncovalent	3	3	3	3
x1187	dimer-interface	1	1	1	1
x1382	active-covalent	7	7	7	7
x1385	active-covalent	2	2	2	2
x1386	active-covalent	1	1	1	1
x1392	active-covalent	2	2	2	2
x1402	active-covalent	2	2	2	2
x1418	active-covalent	2	2	2	2
x1478	active-covalent	1	1	1	1

Some molecules show up multiple times - why? Upon further investigation, this is mainly due to the molecule’s presence in multiple databases

repurposing_df.groupby(['SMILES']).count().sort_values("TITLE")

	TITLE	Hybrid2	docked_fragment	Mpro-_dock	site
SMILES
B(CCCC)(O)O	1	1	1	1	1
CCCc1ccccc1N	1	1	1	1	1
CCCc1cc(=O)[nH]c(=S)[nH]1	1	1	1	1	1
CCC[N@@H+]1CCO[C@H]2[C@H]1CCc3c2cc(cc3)O	1	1	1	1	1
CCC[N@@H+]1CCC[C@H]2[C@H]1Cc3c[nH]nc3C2	1	1	1	1	1
...	...	...	...	...	...
C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@]2(C#C)O)CCC4=CC(=O)CC[C@H]34	2	2	2	2	2
C[C@]12CC[C@H]3[C@H]([C@@H]1CCC2=O)CC(=C)C4=CC(=O)C=C[C@]34C	2	2	2	2	2
CC(C)C[C@@H](C1(CCC1)c2ccc(cc2)Cl)[NH+](C)C	2	2	2	2	2
CC[C@](/C=C/Cl)(C#C)O	2	2	2	2	2
CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CCC4=CC(=O)CC[C@H]34	2	2	2	2	2

432 rows × 5 columns

repurposing_df[repurposing_df['SMILES']=="CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CCC4=CC(=O)CC[C@H]34"]

	SMILES	TITLE	Hybrid2	docked_fragment	Mpro-_dock	site
82	CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CC...	CHEMBL2107797	-9.002963	x0749	2.616094	active-covalent
105	CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CC...	EDRUG178	-8.705896	x0104	2.248707	active-noncovalent

There doesn’t seem to be a very good correlation between the two docking scores - if these are docking scores to different receptors, that would help explain things. It’s worth noting that we’re not seeing if the two numbers agree for each molecule, but if the trends persist (both scores go up for this molecule, but go down for this other molecule). The weak correlation suggests the trends do not persist between the two docking measures

repurposing_df[['Hybrid2', 'Mpro-_dock']].corr()

	Hybrid2	Mpro-_dock
Hybrid2	1.000000	0.581966
Mpro-_dock	0.581966	1.000000

Redocking dataframe: SMILES, names, data collection information, docking scores

redock_df.head()

	SMILES	TITLE	fragments	CompoundCode	Unnamed: 4	covalent_warhead	MountingResult	DataCollectionOutcome	DataProcessingResolutionHigh	RefinementOutcome	Deposition_PDB_ID	Hybrid2	docked_fragment	Mpro-x0500_dock	site
0	c1ccc(c(c1)NCc2ccn[nH]2)F	x0500	x0500	Z1545196403	NaN	False	OK: No comment:No comment	success	2.19	7 - Analysed & Rejected	NaN	-11.881923	x0678	-2.501554	active-noncovalent
1	Cc1ccccc1OCC(=O)Nc2ncccn2	x0415	x0415	Z53834613	NaN	False	OK: No comment:No comment	success	1.62	7 - Analysed & Rejected	NaN	-11.622278	x0678	NaN	active-noncovalent
2	Cc1csc(n1)CNC(=O)c2ccn[nH]2	x0356	x0356	Z466628048	NaN	False	OK: No comment:No comment	success	3.25	7 - Analysed & Rejected	NaN	-11.435024	x0678	NaN	active-noncovalent
3	Cc1csc(n1)CNC(=O)c2ccn[nH]2	x1113	x1113	Z466628048	NaN	False	OK: No comment:No comment	success	1.57	7 - Analysed & Rejected	NaN	-11.435024	x0678	NaN	active-noncovalent
4	c1cc(cnc1)NC(=O)CC2CCCCC2	x0678	x0678	Z31792168	NaN	False	Mounted_Clear	success	1.83	6 - Deposited	5R84	-11.355046	x0678	NaN	active-noncovalent

There don’t seem to be many Mpro docking scores in this dataset (only one molecule has a non-null Mpro docking score)

redock_df[redock_df['Mpro-x0500_dock'].isnull()].count()

SMILES                          1452
TITLE                           1452
fragments                       1452
CompoundCode                    1452
Unnamed: 4                         0
covalent_warhead                1452
MountingResult                  1452
DataCollectionOutcome           1452
DataProcessingResolutionHigh    1357
RefinementOutcome               1306
Deposition_PDB_ID                 78
Hybrid2                         1452
docked_fragment                 1452
Mpro-x0500_dock                    0
site                            1452
dtype: int64

redock_df[~redock_df['Mpro-x0500_dock'].isnull()].count()

SMILES                          1
TITLE                           1
fragments                       1
CompoundCode                    1
Unnamed: 4                      0
covalent_warhead                1
MountingResult                  1
DataCollectionOutcome           1
DataProcessingResolutionHigh    1
RefinementOutcome               1
Deposition_PDB_ID               0
Hybrid2                         1
docked_fragment                 1
Mpro-x0500_dock                 1
site                            1
dtype: int64

Are there overlaps in the molecules in each of these datasets?

repurpose_redock = repurposing_df.merge(redock_df, on='SMILES', how='inner',suffixes=("_L", "_R"))

moonshot_redock = moonshot_df.merge(redock_df, on='SMILES', how='inner',suffixes=("_L", "_R"))

repurpose_redock

	SMILES	TITLE_L	Hybrid2_L	docked_fragment_L	Mpro-_dock	site_L	TITLE_R	fragments	CompoundCode	Unnamed: 4	covalent_warhead	MountingResult	DataCollectionOutcome	DataProcessingResolutionHigh	RefinementOutcome	Deposition_PDB_ID	Hybrid2_R	docked_fragment_R	Mpro-x0500_dock	site_R
0	Cc1cc(=O)n([nH]1)c2ccccc2	CHEMBL290916	-7.889587	x0195	-2.068452	active-noncovalent	x0297	x0297	Z50145861	NaN	False	OK: No comment:No comment	success	1.98	7 - Analysed & Rejected	NaN	-7.889587	x0195	NaN	active-noncovalent
1	CC(C)Nc1ncccn1	CHEMBL1740513	-7.178702	x0072	-1.248482	active-noncovalent	x0583	x0583	Z31190928	NaN	False	OK: No comment:No comment	success	3.08	7 - Analysed & Rejected	NaN	-7.293537	x1093	NaN	active-noncovalent
2	CC(C)Nc1ncccn1	CHEMBL1740513	-7.178702	x0072	-1.248482	active-noncovalent	x1102	x1102	Z31190928	NaN	False	OK: No comment:No comment	success	1.46	7 - Analysed & Rejected	NaN	-7.293537	x1093	NaN	active-noncovalent
3	C[C@H](C(=O)[O-])O	CHEMBL1200559	-5.675188	x0397	-0.179049	active-noncovalent	x1035	x1035	Z1741982441	NaN	False	OK: No comment:No comment	Failed - no diffraction	NaN	NaN	NaN	-6.505556	x0397	NaN	active-noncovalent
4	CC(=O)C(=O)[O-]	DB00119	-5.448891	x0689	-0.494791	active-covalent	x1037	x1037	Z1741977082	NaN	False	OK: No comment:No comment	Failed - no diffraction	NaN	NaN	NaN	-5.448891	x0689	NaN	active-covalent
5	CCC(=O)[O-]	CHEMBL14021	-5.374838	x0397	-0.555688	active-noncovalent	x1029	x1029	Z955123616	NaN	False	OK: No comment:No comment	success	1.73	7 - Analysed & Rejected	NaN	-5.135675	x0689	NaN	active-covalent
6	C1CNCC[NH2+]1	CHEMBL1412	-5.079155	x0354	1.716032	active-noncovalent	x0996	x0996	Z1245537944	NaN	False	OK: No comment:No comment	success	1.96	7 - Analysed & Rejected	NaN	-4.675085	x0354	NaN	active-noncovalent

We joined on SMILES string, and now we can compare the docking scores between the repurposing and redocking datasets.

Some Hybrid2 scores look quantitatively similar, but for those that don’t, the ranking is still there. Looking at the COVID-19 main protease (Mpro I believe?), the docking scores don’t follow similar rankings - docking scores aren’t transferable to different receptors (this might be a fairly obvious observation)

repurpose_redock[['SMILES', "TITLE_L", "TITLE_R", "Hybrid2_L", "Hybrid2_R", 'Mpro-_dock', 'Mpro-x0500_dock']]

	SMILES	TITLE_L	TITLE_R	Hybrid2_L	Hybrid2_R	Mpro-_dock	Mpro-x0500_dock
0	Cc1cc(=O)n([nH]1)c2ccccc2	CHEMBL290916	x0297	-7.889587	-7.889587	-2.068452	NaN
1	CC(C)Nc1ncccn1	CHEMBL1740513	x0583	-7.178702	-7.293537	-1.248482	NaN
2	CC(C)Nc1ncccn1	CHEMBL1740513	x1102	-7.178702	-7.293537	-1.248482	NaN
3	C[C@H](C(=O)[O-])O	CHEMBL1200559	x1035	-5.675188	-6.505556	-0.179049	NaN
4	CC(=O)C(=O)[O-]	DB00119	x1037	-5.448891	-5.448891	-0.494791	NaN
5	CCC(=O)[O-]	CHEMBL14021	x1029	-5.374838	-5.135675	-0.555688	NaN
6	C1CNCC[NH2+]1	CHEMBL1412	x0996	-5.079155	-4.675085	1.716032	NaN

Joining the moonshot submission and redocking datasets does not yield too many overlapping molecules

moonshot_redock

	SMILES	CID	creator	fragments_L	link	real_space	SCR	BB	extended_real_space	in_molport_or_mcule	in_ultimate_mcule	in_emolecules	covalent_frag	covalent_warhead_L	acrylamide	acrylamide_adduct	chloroacetamide	chloroacetamide_adduct	vinylsulfonamide	vinylsulfonamide_adduct	nitrile	nitrile_adduct	MW	cLogP	HBD	HBA	TPSA	BMS	Dundee	Glaxo	Inpharmatica	LINT	MLSMR	PAINS	SureChEMBL	PostEra	ORDERED	MADE	ASSAYED	TITLE	fragments_R	CompoundCode	Unnamed: 4	covalent_warhead_R	MountingResult	DataCollectionOutcome	DataProcessingResolutionHigh	RefinementOutcome	Deposition_PDB_ID	Hybrid2	docked_fragment	Mpro-x0500_dock	site
0	CC(C)Nc1cccnc1	MAK-UNK-2c1752f0-4	Maksym Voznyy	x1093	https://covid.postera.ai/covid/submissions/MAK...	FALSE	Z2574930241	EN300-56005	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	136.198	1.9019	1	2	24.92	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	False	False	False	x1098	x1098	Z1259341037	NaN	False	OK: No comment:No comment	success	1.66	7 - Analysed & Rejected	NaN	-7.474369	x0678	NaN	active-noncovalent
1	CC(C)Nc1cccnc1	MAK-UNK-2c1752f0-4	Maksym Voznyy	x1093	https://covid.postera.ai/covid/submissions/MAK...	FALSE	Z2574930241	EN300-56005	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	136.198	1.9019	1	2	24.92	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	False	False	False	x0572	x0572	Z1259341037	NaN	False	OK: No comment:No comment	success	2.98	7 - Analysed & Rejected	NaN	-7.474369	x0678	NaN	active-noncovalent
2	CCS(=O)(=O)Nc1ccccc1F	MAK-UNK-2c1752f0-5	Maksym Voznyy	x1093	https://covid.postera.ai/covid/submissions/MAK...	FALSE	Z53825177	EN300-116204	FALSE	False	True	False	False	False	False	False	False	False	False	False	False	False	203.238	1.5873	1	2	46.17	PASS	PASS	PASS	PASS	PASS	Hetero_hetero	PASS	PASS	PASS	False	False	False	x0247	x0247	Z53825177	NaN	False	OK: No comment:No comment	success	1.83	7 - Analysed & Rejected	NaN	-7.413380	x0678	NaN	active-noncovalent

Comparing other databases

CHEMBL, DrugBank, and “EDrug”(?) look to be the 3 prefixes in the “TITLE” column

from chembl_webresource_client.new_client import new_client
molecule = new_client.molecule
res = molecule.search('CHEMBL1387')

res_df = pd.DataFrame.from_dict(res)

res_df.columns

Index(['atc_classifications', 'availability_type', 'biotherapeutic',
       'black_box_warning', 'chebi_par_id', 'chirality', 'cross_references',
       'dosed_ingredient', 'first_approval', 'first_in_class', 'helm_notation',
       'indication_class', 'inorganic_flag', 'max_phase', 'molecule_chembl_id',
       'molecule_hierarchy', 'molecule_properties', 'molecule_structures',
       'molecule_synonyms', 'molecule_type', 'natural_product', 'oral',
       'parenteral', 'polymer_flag', 'pref_name', 'prodrug', 'score',
       'structure_type', 'therapeutic_flag', 'topical', 'usan_stem',
       'usan_stem_definition', 'usan_substem', 'usan_year', 'withdrawn_class',
       'withdrawn_country', 'withdrawn_flag', 'withdrawn_reason',
       'withdrawn_year'],
      dtype='object')

res_df[['chirality', 'molecule_properties', 'molecule_structures', 'score']]

	chirality	molecule_properties	molecule_structures	score
0	1	{'alogp': '3.64', 'aromatic_rings': 0, 'cx_log...	{'canonical_smiles': 'C#C[C@]1(O)CC[C@H]2[C@@H...	17.0

res_df[['molecule_properties']].values[0]

array([{'alogp': '3.64', 'aromatic_rings': 0, 'cx_logd': '2.81', 'cx_logp': '2.81', 'cx_most_apka': None, 'cx_most_bpka': None, 'full_molformula': 'C20H26O2', 'full_mwt': '298.43', 'hba': 2, 'hba_lipinski': 2, 'hbd': 1, 'hbd_lipinski': 1, 'heavy_atoms': 22, 'molecular_species': None, 'mw_freebase': '298.43', 'mw_monoisotopic': '298.1933', 'num_lipinski_ro5_violations': 0, 'num_ro5_violations': 0, 'psa': '37.30', 'qed_weighted': '0.55', 'ro3_pass': 'N', 'rtb': 0}],
      dtype=object)

res_df['molecule_properties'].apply(pd.Series)

	alogp	aromatic_rings	cx_logd	cx_logp	cx_most_apka	cx_most_bpka	full_molformula	full_mwt	hba	hba_lipinski	hbd	hbd_lipinski	heavy_atoms	molecular_species	mw_freebase	mw_monoisotopic	num_lipinski_ro5_violations	num_ro5_violations	psa	qed_weighted	ro3_pass	rtb
0	3.64	0	2.81	2.81	None	None	C20H26O2	298.43	2	2	1	1	22	None	298.43	298.1933	0	0	37.30	0.55	N	0

all_results = [molecule.search(a) for a in repurposing_df['TITLE']]

Here’s a big Python function tangent.

For each chembl molecule, we’ve searched for it within the chembl, returning us a list (of length 1) containing a dictionary of properties.

All molecules have been compiled into a list, so we have a list of lists of dicionatires.

For sanity, we can use a Python filter to only retain the non-None results.

We can chain that with a Python map function to parse the first item from each molecule’s list. Recall, each molecule was a list with just one element, a dictionary. We can boil this down to only returning the dictionary (eliminating the list wrapper).

For validation, I’ve called next to look at the results

filtered = map(lambda x: x[0], filter(lambda x: x is not None, all_results))

next(filtered)

{'atc_classifications': [],
 'availability_type': -1,
 'biotherapeutic': None,
 'black_box_warning': 0,
 'chebi_par_id': None,
 'chirality': 0,
 'cross_references': [],
 'dosed_ingredient': False,
 'first_approval': None,
 'first_in_class': 0,
 'helm_notation': None,
 'indication_class': 'Anti-Inflammatory',
 'inorganic_flag': 0,
 'max_phase': 0,
 'molecule_chembl_id': 'CHEMBL2104122',
 'molecule_hierarchy': {'molecule_chembl_id': 'CHEMBL2104122',
  'parent_chembl_id': 'CHEMBL2104122'},
 'molecule_properties': {'alogp': '3.45',
  'aromatic_rings': 2,
  'cx_logd': '1.26',
  'cx_logp': '3.92',
  'cx_most_apka': '4.68',
  'cx_most_bpka': None,
  'full_molformula': 'C16H14O2',
  'full_mwt': '238.29',
  'hba': 1,
  'hba_lipinski': 2,
  'hbd': 1,
  'hbd_lipinski': 1,
  'heavy_atoms': 18,
  'molecular_species': 'ACID',
  'mw_freebase': '238.29',
  'mw_monoisotopic': '238.0994',
  'num_lipinski_ro5_violations': 0,
  'num_ro5_violations': 0,
  'psa': '37.30',
  'qed_weighted': '0.74',
  'ro3_pass': 'N',
  'rtb': 2},
 'molecule_structures': {'canonical_smiles': 'CC(C(=O)O)c1ccc2c(c1)Cc1ccccc1-2',
  'molfile': '\n     RDKit          2D\n\n 18 20  0  0  0  0  0  0  0  0999 V2000\n   -0.5375    0.0250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -0.5375    1.1083    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.4458    1.1083    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.4458    0.0250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    1.3625    0.0250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.4875   -0.5125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.4125   -0.5125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    3.3292    0.0250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.4125    1.6500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    2.3417   -0.5292    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    1.3625    1.1083    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    3.3500    1.1958    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0\n    4.2167   -0.6292    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.3958    1.6500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.3958   -0.5125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    2.3417   -1.6417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -4.3458    1.1083    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -4.3458    0.0250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n  2  1  2  0\n  3  2  1  0\n  4  6  1  0\n  5  7  2  0\n  6  1  1  0\n  7  1  1  0\n  8 10  1  0\n  9  2  1  0\n 10  5  1  0\n 11  5  1  0\n 12  8  2  0\n 13  8  1  0\n 14  3  1  0\n 15  4  1  0\n 16 10  1  0\n 17 14  2  0\n 18 15  2  0\n  3  4  2  0\n  9 11  2  0\n 17 18  1  0\nM  END\n\n> <chembl_id>\nCHEMBL2104122\n\n> <chembl_pref_name>\nCICLOPROFEN\n\n',
  'standard_inchi': 'InChI=1S/C16H14O2/c1-10(16(17)18)11-6-7-15-13(8-11)9-12-4-2-3-5-14(12)15/h2-8,10H,9H2,1H3,(H,17,18)',
  'standard_inchi_key': 'LRXFKKPEBXIPMW-UHFFFAOYSA-N'},
 'molecule_synonyms': [{'molecule_synonym': 'Cicloprofen',
   'syn_type': 'BAN',
   'synonyms': 'CICLOPROFEN'},
  {'molecule_synonym': 'Cicloprofen',
   'syn_type': 'INN',
   'synonyms': 'CICLOPROFEN'},
  {'molecule_synonym': 'Cicloprofen',
   'syn_type': 'USAN',
   'synonyms': 'CICLOPROFEN'},
  {'molecule_synonym': 'SQ-20824',
   'syn_type': 'RESEARCH_CODE',
   'synonyms': 'SQ 20824'}],
 'molecule_type': 'Small molecule',
 'natural_product': 0,
 'oral': False,
 'parenteral': False,
 'polymer_flag': False,
 'pref_name': 'CICLOPROFEN',
 'prodrug': 0,
 'score': 16.0,
 'structure_type': 'MOL',
 'therapeutic_flag': False,
 'topical': False,
 'usan_stem': '-profen',
 'usan_stem_definition': 'anti-inflammatory/analgesic agents (ibuprofen type)',
 'usan_substem': '-profen',
 'usan_year': 1974,
 'withdrawn_class': None,
 'withdrawn_country': None,
 'withdrawn_flag': False,
 'withdrawn_reason': None,
 'withdrawn_year': None}

For now, I’m only really interested in the molecule_properties dictionary

filtered = [a[0]['molecule_properties'] for a in all_results if len(a) > 0]

chembl_df = pd.DataFrame(filtered)
chembl_df['TITLE'] = repurposing_df['TITLE']

Molecular properties contained in the chembl database

Here are the definitions I can dig up

alogp: (lipophilicity) partition coefficient
aromatic_rings: number of aromatic rings
cx_logd: distribution coefficient taking into account ionized and non-ionized forms
cx_most_apka: acidic pka
cx_most_bpka: basic pka
full_mwt: molecular weight (and also free base and monoisotopic masses)
hba: hydrogen bond acceptors (and hba_lipinski for lipinski definitiosn)
hbd: hydrogen bond donors (and hbd_lipinski)
heavy_atoms: number of heavy atoms
num_lipinski_ro5_violations: how many times this molecule violated Lipinski’s rule of five
num_ro5_violations: not sure, seems similar to lipinski rule of 5
psa: protein sequence alignment
qed_weighted: “quantitative estimate of druglikeness” (ranges between 0 and 1, with 1 being more favorable). This is based on a quantitatve mean of drugability functions
ro3_pass: rule of three
rtb: number of rotatable bonds

chembl_df.head()

	alogp	aromatic_rings	cx_logd	cx_logp	cx_most_apka	cx_most_bpka	full_molformula	full_mwt	hba	hba_lipinski	hbd	hbd_lipinski	heavy_atoms	molecular_species	mw_freebase	mw_monoisotopic	psa	qed_weighted	ro3_pass	rtb	TITLE
0	3.45	2.0	1.26	3.92	4.68	None	C16H14O2	238.29	1.0	2.0	1.0	1.0	18.0	ACID	238.29	238.0994	37.30	0.74	N	2.0	CHEMBL2104122
1	3.64	0.0	2.81	2.81	None	None	C20H26O2	298.43	2.0	2.0	1.0	1.0	22.0	None	298.43	298.1933	37.30	0.55	N	0.0	CHEMBL1387
2	3.92	1.0	4.25	4.25	10.15	2.86	C18H24N2O2S	332.47	4.0	4.0	2.0	3.0	23.0	NEUTRAL	332.47	332.1558	75.68	0.76	N	1.0	CHEMBL275835
3	4.31	0.0	4.04	4.04	None	None	C20H28O	284.44	1.0	1.0	1.0	1.0	21.0	None	284.44	284.2140	20.23	0.52	N	0.0	CHEMBL2104104
4	4.79	0.0	3.96	3.96	None	None	C21H28O2	312.45	2.0	2.0	0.0	0.0	23.0	None	312.45	312.2089	34.14	0.70	N	1.0	CHEMBL2104231

chembl_df.columns

Index(['alogp', 'aromatic_rings', 'cx_logd', 'cx_logp', 'cx_most_apka',
       'cx_most_bpka', 'full_molformula', 'full_mwt', 'hba', 'hba_lipinski',
       'hbd', 'hbd_lipinski', 'heavy_atoms', 'molecular_species',
       'mw_freebase', 'mw_monoisotopic', 'num_lipinski_ro5_violations',
       'num_ro5_violations', 'psa', 'qed_weighted', 'ro3_pass', 'rtb',
       'TITLE'],
      dtype='object')

chembl_df.corr()

	aromatic_rings	hba	hba_lipinski	hbd	hbd_lipinski	heavy_atoms	num_lipinski_ro5_violations	num_ro5_violations	rtb
aromatic_rings	1.000000	0.192569	0.178507	0.014928	0.036106	0.249022	0.031094	0.031094	0.229124
hba	0.192569	1.000000	0.868859	0.084553	0.054409	0.451560	-0.047705	-0.047705	-0.023690
hba_lipinski	0.178507	0.868859	1.000000	0.348600	0.294276	0.295864	-0.070783	-0.070783	0.021812
hbd	0.014928	0.084553	0.348600	1.000000	0.935710	-0.172866	-0.060462	-0.060462	0.040505
hbd_lipinski	0.036106	0.054409	0.294276	0.935710	1.000000	-0.211899	-0.085660	-0.085660	0.084225
heavy_atoms	0.249022	0.451560	0.295864	-0.172866	-0.211899	1.000000	0.397240	0.397240	0.259011
num_lipinski_ro5_violations	0.031094	-0.047705	-0.070783	-0.060462	-0.085660	0.397240	1.000000	1.000000	0.345308
num_ro5_violations	0.031094	-0.047705	-0.070783	-0.060462	-0.085660	0.397240	1.000000	1.000000	0.345308
rtb	0.229124	-0.023690	0.021812	0.040505	0.084225	0.259011	0.345308	0.345308	1.000000

At a glance, no definite linear correlations among this crowd besides pKas, partition coefficients, mwt/hba

corr_df = chembl_df.corr()
cols = chembl_df.columns

fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)

ax.imshow(chembl_df.corr(), cmap='RdBu')

ax.set_xticklabels(['']+cols)
ax.tick_params(axis='x', rotation=90)

ax.set_yticklabels(cols)

for i, (rowname, row) in enumerate(corr_df.iterrows()):
    for j, (key, val) in enumerate(row.iteritems()):
        ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")

png

Maybe there are higher-order correlations and relationship more appropriate for clustering and decomposition

cols = ['aromatic_rings', 'cx_logp',  'full_mwt', 'hba']
cleaned = (chembl_df[~chembl_df[cols]
                     .isnull()
                     .all(axis='columns', skipna=False)][cols]
           .astype('float')
           .fillna(0, axis='columns'))

from sklearn import preprocessing

normalized = preprocessing.scale(cleaned)

Appears to be maybe 4 clusters of these compounds examined by the covid-moonshot group

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

tsne_analysis = TSNE(n_components=2)
output = tsne_analysis.fit_transform(normalized)
fig,ax = plt.subplots(1,1)
ax.scatter(output[:,0], output[:,1])
ax.set_title("Aromatic rings, cx_logp, mwt, hba")

Text(0.5, 1.0, 'Aromatic rings, cx_logp, mwt, hba')

png

By taking turns leaving out some features, it looks like leaving out aromatic rings or hydrogen bond acceptors will diminish the cluster distinction.

Aromatic rings are huge and bulky components to small molecules, it makes sense that a chunk of the behavior corresponds to the aromatic rings. Similarly, hydrogen bond acceptors (heavy molecules) also induce van der Waals and electrostatics influences on small molecules. Left with only weight and partition coefficient, there’s mainly a continous behavior

def clean_df(cols):
    cleaned = (chembl_df[~chembl_df[cols]
                     .isnull()
                     .all(axis='columns', skipna=False)][cols]
           .astype('float')
           .fillna(0, axis='columns'))

    normalized = preprocessing.scale(cleaned)
    
    return normalized

cols = ['cx_logp',  'full_mwt', 'hba']
normalized = clean_df(cols)
tsne_analysis = TSNE(n_components=2)
output = tsne_analysis.fit_transform(normalized)
fig,ax = plt.subplots(3,1, figsize=(8,8))
ax[0].scatter(output[:,0], output[:,1])
ax[0].set_title("cx_logp, mwt, hba")

cols = ['cx_logp',  'full_mwt', 'aromatic_rings']
normalized = clean_df(cols)

tsne_analysis = TSNE(n_components=2)
output = tsne_analysis.fit_transform(normalized)

ax[1].scatter(output[:,0], output[:,1])
ax[1].set_title("aromatic_rings, cx_logp, mwt")

cols = ['cx_logp',  'full_mwt']
normalized = clean_df(cols)

ax[2].scatter(normalized[:,0], normalized[:,1])
ax[2].set_title("cx_logp, mwt")

fig.tight_layout()

png

DrugBank

I found someone had already downloaded the database. I may double-over these dataframes, but query the drugbank dataset rather than chembl

Some docking data

We have some smiles strings, molecular properties, docking scores, and information about the docking fragments

moonshot = pd.read_csv('moonshot-submissions/covid_submissions_all_info-docked-overlap.csv')

moonshot

	SMILES	TITLE	creator	fragments	link	real_space	SCR	BB	extended_real_space	in_molport_or_mcule	in_ultimate_mcule	in_emolecules	covalent_frag	covalent_warhead	acrylamide	acrylamide_adduct	chloroacetamide	chloroacetamide_adduct	vinylsulfonamide	vinylsulfonamide_adduct	nitrile	nitrile_adduct	MW	cLogP	HBD	HBA	TPSA	num_criterion_violations	BMS	Dundee	Glaxo	Inpharmatica	LINT	MLSMR	PAINS	SureChEMBL	PostEra	ORDERED	MADE	ASSAYED	Hybrid2	docked_fragment	Mpro-x1418_dock	site	number_of_overlapping_fragments	overlapping_fragments	overlap_score	volume
0	c1ccc(cc1)n2c3cc(c(cc3c(=O)c(c2[O-])c4cccnc4)F)Cl	MAK-UNK-9e4a73aa-2	Maksym Voznyy	x1418	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	366.779	4.51890	0	3	50.27	0	PASS	beta-keto/anhydride	PASS	PASS	PASS	Ketone, Dye 11	PASS	PASS	PASS	False	False	False	-11.881256	x1418	1.206534	active-covalent	3	x0434,x0678,x0830	3.208124	271.986084
1	Cc1ccncc1n2c(=O)ccc3c2CCCN3CC(=[NH2+])N	KIM-UNI-60f168f5-7	Kim Tai Tran, University of Copenhagen	x0107,x0991	https://covid.postera.ai/covid/submissions/KIM...	FALSE	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	297.362	1.22949	2	5	88.00	0	PASS	imine, imine	PASS	PASS	acyclic C=N-H	Imine 3	PASS	PASS	PASS	False	False	False	-11.654112	x0107	NaN	active-noncovalent	3	x0107,x1412,x1392	4.753475	232.815506
2	c1ccc(cc1)n2c3cc(c(cc3c(=O)n(c2=O)c4cnccn4)F)Cl	MAK-UNK-9e4a73aa-14	Maksym Voznyy	x1418	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	368.755	2.72410	0	6	69.78	0	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	False	False	False	-10.460650	x0678	2.716276	active-noncovalent	3	x0678,x1412,x1392	5.520980	266.688721
3	Cc1ccncc1N(C=C)[C@H]([C@@H](C)[C@@H]2CN=Cc3c2c...	AUS-WAB-916db9c0-1	Austin D. Chivington, Wabash College	x0107,x1077,x1374	https://covid.postera.ai/covid/submissions/AUS...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	351.450	3.51932	1	5	57.95	0	non_ring_acetal	het-C-het not in ring	PASS	Filter10_Terminal_vinyl	PASS	PASS	PASS	PASS	PASS	False	False	False	-9.516450	x0678	NaN	active-noncovalent	3	x0434,x0831,x0678	3.446572	284.195312
4	c1ccc2c(c1)ncc(n2)/C=C/C(=O)c3cccc(c3)O	DRV-DNY-ae159ed1-12	Dr. Vidya Desai, Dnyanprassarak Mandals Colleg...	x1249	https://covid.postera.ai/covid/submissions/DRV...	FALSE	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	276.295	3.23150	1	4	63.08	0	PASS	PASS	PASS	Filter44_michael_acceptor2	PASS	Ketone, Dye 9, vinyl michael acceptor1	PASS	PASS	PASS	False	False	False	-9.243208	x0678	NaN	active-noncovalent	3	x0434,x0678,x0830	2.865147	220.275421
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4630	C[C@H]([C@@H](C(=O)N[C@H](Cc1ccccc1)C(=O)N[C@@...	PAU-UNI-6d15a9f5-4	paul brear, University of cambridge	x1086	https://covid.postera.ai/covid/submissions/PAU...	FALSE	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	714.821	-0.91270	8	11	256.10	4	PASS	PASS	PASS	PASS	PASS	Long aliphatic chain, Dipeptide	PASS	PASS	PASS	False	False	False	3.175111	x0305	NaN	active-noncovalent	0	NaN	5.297134	548.583191
4631	c1cc2cc(c(cc2c(c1)S(=O)(=O)N3CC[NH+](CC3)Cc4cc...	MAK-UNK-e05327b2-2	Maksym Voznyy	x1402	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	True	False	False	True	False	False	False	False	False	837.964	6.63190	0	9	98.31	2	PASS	PASS	PASS	PASS	PASS	Hetero_hetero	PASS	PASS	PASS	False	False	False	3.561681	x1392	NaN	active-covalent	0	NaN	3.297014	591.877563
4632	Cc1cccc(c1)C[NH+]2CCN(CC2)C(=O)c3ccc(cc3)C#Cc4...	MAK-UNK-e4a48a85-16	Maksym Voznyy	x0387,x0692	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	574.794	6.18892	0	5	39.68	2	PASS	triple bond	PASS	PASS	PASS	PASS	PASS	PASS	PASS	False	False	False	4.056698	x0978	NaN	active-covalent	0	NaN	4.360606	470.944824
4633	c1cc2cc(c(cc2c(c1)S(=O)(=O)N3CC[NH+](CC3)Cc4cc...	MAK-UNK-e05327b2-6	Maksym Voznyy	x1402	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	True	False	False	True	False	False	False	False	False	990.183	5.19160	0	12	138.93	3	alpha_halo_heteroatom, secondary_halide_sulfate	PASS	PASS	PASS	PASS	Hetero_hetero	PASS	Dithiomethylene_acetal	Alkyl Halide	False	False	False	4.242827	x0731	NaN	active-covalent	0	NaN	4.193186	694.333069
4634	Cc1cccc(c1)C[NH+]2CCN(CC2)c3cc(c(c(c3)Cl)c4cc5...	MAK-UNK-e4a48a85-15	Maksym Voznyy	x0387,x0692	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	659.687	7.36362	1	7	68.36	2	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	False	False	False	5.966927	x0705	NaN	active-covalent	0	NaN	1.473711	503.583801

4635 rows × 48 columns

moonshot.head(5)

	SMILES	TITLE	creator	fragments	link	real_space	SCR	BB	extended_real_space	in_molport_or_mcule	in_ultimate_mcule	in_emolecules	covalent_frag	covalent_warhead	acrylamide	acrylamide_adduct	chloroacetamide	chloroacetamide_adduct	vinylsulfonamide	vinylsulfonamide_adduct	nitrile	nitrile_adduct	MW	cLogP	HBD	HBA	TPSA	BMS	Dundee	Glaxo	Inpharmatica	LINT	MLSMR	PAINS	SureChEMBL	PostEra	ORDERED	MADE	ASSAYED	Hybrid2	docked_fragment	Mpro-x1418_dock	site	number_of_overlapping_fragments	overlapping_fragments	overlap_score	volume
0	c1ccc(cc1)n2c3cc(c(cc3c(=O)c(c2[O-])c4cccnc4)F)Cl	MAK-UNK-9e4a73aa-2	Maksym Voznyy	x1418	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	366.779	4.51890	0	3	50.27	PASS	beta-keto/anhydride	PASS	PASS	PASS	Ketone, Dye 11	PASS	PASS	PASS	False	False	False	-11.881256	x1418	1.206534	active-covalent	3	x0434,x0678,x0830	3.208124	271.986084
1	Cc1ccncc1n2c(=O)ccc3c2CCCN3CC(=[NH2+])N	KIM-UNI-60f168f5-7	Kim Tai Tran, University of Copenhagen	x0107,x0991	https://covid.postera.ai/covid/submissions/KIM...	FALSE	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	297.362	1.22949	2	5	88.00	PASS	imine, imine	PASS	PASS	acyclic C=N-H	Imine 3	PASS	PASS	PASS	False	False	False	-11.654112	x0107	NaN	active-noncovalent	3	x0107,x1412,x1392	4.753475	232.815506
2	c1ccc(cc1)n2c3cc(c(cc3c(=O)n(c2=O)c4cnccn4)F)Cl	MAK-UNK-9e4a73aa-14	Maksym Voznyy	x1418	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	368.755	2.72410	0	6	69.78	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	False	False	False	-10.460650	x0678	2.716276	active-noncovalent	3	x0678,x1412,x1392	5.520980	266.688721
3	Cc1ccncc1N(C=C)[C@H]([C@@H](C)[C@@H]2CN=Cc3c2c...	AUS-WAB-916db9c0-1	Austin D. Chivington, Wabash College	x0107,x1077,x1374	https://covid.postera.ai/covid/submissions/AUS...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	351.450	3.51932	1	5	57.95	non_ring_acetal	het-C-het not in ring	PASS	Filter10_Terminal_vinyl	PASS	PASS	PASS	PASS	PASS	False	False	False	-9.516450	x0678	NaN	active-noncovalent	3	x0434,x0831,x0678	3.446572	284.195312
4	c1ccc2c(c1)ncc(n2)/C=C/C(=O)c3cccc(c3)O	DRV-DNY-ae159ed1-12	Dr. Vidya Desai, Dnyanprassarak Mandals Colleg...	x1249	https://covid.postera.ai/covid/submissions/DRV...	FALSE	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	276.295	3.23150	1	4	63.08	PASS	PASS	PASS	Filter44_michael_acceptor2	PASS	Ketone, Dye 9, vinyl michael acceptor1	PASS	PASS	PASS	False	False	False	-9.243208	x0678	NaN	active-noncovalent	3	x0434,x0678,x0830	2.865147	220.275421

moonshot['Mpro-x1418_dock'].isnull().sum() # Lots of missing Mpro dock scores

While there are a lot of different fragments to which the small molecule can bind, there are two “classes”, active-covalent and active-noncovalent (possibly referring to sites that covalently bond?)

This presents a way to logically bisect the data based on some fundamental chemistry of the binding pocket.

moonshot['docked_fragment'].value_counts()

x0678    940
x0749    771
x0104    347
x0831    283
x0830    281
x0195    269
x0161    252
x0107    201
x0072    172
x1077    127
x1392    107
x1093    107
x0434    105
x0874     81
x1385     69
x1418     58
x1334     50
x0967     46
x0397     42
x0946     38
x0692     37
x0759     37
x1386     35
x0395     29
x0305     24
x1311     16
x0708     13
x0774     12
x1380     10
x1412      7
x1374      7
x1348      6
x0770      5
x1249      5
x0387      5
x0736      4
x0705      4
x1358      3
x0426      3
x1375      3
x0734      3
x0540      3
x0354      3
x1382      3
x0755      1
x1458      1
x0689      1
x0769      1
x0981      1
x0978      1
x0731      1
x1493      1
x0771      1
x1478      1
x1384      1
x1351      1
Name: docked_fragment, dtype: int64

moonshot['site'].value_counts()

active-noncovalent    2799
active-covalent       1836
Name: site, dtype: int64

We can examine the same correlations, but now for each type of site, and look at the hybrid docking score correlations.

The biggest trend differences appear with the partition coefficient and number of hydrogen bond donors, but still the correlations are extremely weak

site_type = 'active-noncovalent'
fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
cols = ['MW', 'cLogP', 'HBD', 'HBA', 'TPSA', 'Hybrid2']
ax.matshow(moonshot[moonshot['site']==site_type][cols].corr(), cmap='RdBu')

ax.set_xticks([i for i,_ in enumerate(cols)])
ax.set_xticklabels(cols)

ax.set_yticks([i for i,_ in enumerate(cols)])
ax.set_yticklabels(cols)

for i, (rowname, row) in enumerate(moonshot[moonshot['site']==site_type][cols].corr().iterrows()):
    for j, (key, val) in enumerate(row.iteritems()):
        ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")
ax.set_title(f"Docking to {site_type}")

Text(0.5, 1.05, 'Docking to active-noncovalent')

png

site_type = 'active-covalent'
fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
cols = ['MW', 'cLogP', 'HBD', 'HBA', 'TPSA', 'Hybrid2']
ax.matshow(moonshot[moonshot['site']==site_type][cols].corr(), cmap='RdBu')

ax.set_xticks([i for i,_ in enumerate(cols)])
ax.set_xticklabels(cols)

ax.set_yticks([i for i,_ in enumerate(cols)])
ax.set_yticklabels(cols)

for i, (rowname, row) in enumerate(moonshot[moonshot['site']==site_type][cols].corr().iterrows()):
    for j, (key, val) in enumerate(row.iteritems()):
        ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")
ax.set_title(f"Docking to {site_type}")

Text(0.5, 1.05, 'Docking to active-covalent')

png

In general, lower docking score seem better, so the noncovalent sites might present more optimal binding locations (see histogram below). This seems non-intuitive because, if active-covalent really means sites that bond covalently, then covalent bonds would seem more energetically favorable than non-covalent interactions. Alternatively, forming covalent bonds might suggest an unstable region of the complex that could be shielded from the surroundings, inhibiting any sort of small molecule from binding the pocket? Expert opinion would be much appreciated here

fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
covalent_mean = moonshot[moonshot['site']=='active-covalent']['Hybrid2'].mean()
noncovalent_mean = moonshot[moonshot['site']=='active-noncovalent']['Hybrid2'].mean()

ax.hist(moonshot[moonshot['site']=='active-covalent']['Hybrid2'], alpha=0.5, 
        label=f'active-covalent (mean={covalent_mean:.3f})')
ax.hist(moonshot[moonshot['site']=='active-noncovalent']['Hybrid2'], alpha=0.5, 
        label=f'active-noncovalent (mean={noncovalent_mean:.3f})')

ax.set_title(f"Hybrid2 histogram")
ax.set_xlabel("Hybrid2 score")
ax.legend()

<matplotlib.legend.Legend at 0x7fac6b459850>

png

from rdkit import Chem

rdkit_smiles = [Chem.MolFromSmiles(a) for a in moonshot.sort_values('Hybrid2', ascending=True)['SMILES'].head(10)]
scores = [f"{a:.3f}" for a in moonshot.sort_values('Hybrid2', ascending=True)['Hybrid2'].head(10)]

img=Chem.Draw.MolsToGridImage(rdkit_smiles,molsPerRow=5,subImgSize=(200,200),
                             legends=scores)

img

png

Share on

Twitter Facebook LinkedIn

Alex H. Yang

Digging through some Folding@Home data

Learning cheminformatics from some Folding@Home data

Introduction

Some docking results

Comparing other databases

Molecular properties contained in the chembl database

Some docking data

Share on

You May Also Enjoy

Scraping Reddit, part 2

Scraping Reddit, part 1

Accessing FoldingAtHome data on AWS

Poetry and Docker