Digging through some Folding@Home data
Learning cheminformatics from some Folding@Home data
2020-05-06 - 2020-05-11
I have no formal training in cheminformatics, so I am going to be stumbling and learning as I wade through this dataset. I welcome any learning lessons from experts.
This will be an ongoing foray
Source: https://github.com/FoldingAtHome/covid-moonshot
Introduction
Folding@Home is a distributed computing project - allowing molecular simulations to be run in parallel across thousands of different computers with minimal communication. This, combined with other molecular modeling methods, has yielded a lot of open data for others to examine. In particular, I'm interested in the docking screens and compounds targeted by the F@H and postera collaborations
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
pd.options.display.max_columns = 999
moonshot_df = pd.read_csv('moonshot-submissions/covid_submissions_all_info.csv')
moonshot_df.head()
The moonshot data has a lot of logging/metadata information, some one-hot-encoding information about functional groups, and some additional columns about Glaxo, Dundee, BMS, Lint, PAINS, SureChEMBL - I'm not sure what those additional coluns mean, but the values are binary values, possibly the results of some other test or availability in another databases.
I'm going to focus on the molecular properties: MW, cLogP, HBD, HBA, TPSA
- MW: Molecular Weight
- cLogP: The logarithm of the partition coefficient (ratio of concentrations in octanol vs water, $\log{\frac{c_{octanol}}{c_{water}}}$)
- HBD: Hydrogen bond donors
- HBA: Hydrogen bond acceptors
- TPSA: Topological polar surface area
Some of the correlations make some chemical sense - heavier molecules have more heavy atoms (O, N, F, etc.), but these heavier atoms are also the hydrogen bond acceptors. By that logic, more heavy atoms also coincides with more electronegative atoms, increasing your TPSA. It's a little convoluted because TPSA looks at the surface, not necessarily the volume of the compound; geometry/shape will influence TPSA. There don't appear to be any strong correlations with cLogP. Partition coefficients are a complex function of polarity, size/sterics, and shape - a 1:1 correlation with a singular, other variable will be hard to pinpoint
This csv file doesn't have much other numerical data, but maybe some of those true/false, pass/fail data might be relevant...but I definitely need more context here
fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
cols = ['MW', 'cLogP', 'HBD', 'HBA', 'TPSA']
ax.matshow(moonshot_df[cols].corr(), cmap='RdBu')
ax.set_xticks([i for i,_ in enumerate(cols)])
ax.set_xticklabels(cols)
ax.set_yticks([i for i,_ in enumerate(cols)])
ax.set_yticklabels(cols)
for i, (rowname, row) in enumerate(moonshot_df[cols].corr().iterrows()):
for j, (key, val) in enumerate(row.iteritems()):
ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")
Some docking results
Okay here's a couple other CSVs I found, these include some docking scores
- Repurposing scores: "The Drug Repurposing Hub is a curated and annotated collection of FDA-approved drugs, clinical trial drugs, and pre-clinical tool compounds with a companion information resource" source here, so a public dataset of some drugs
- Redock scores: "This directory contains experiments in redocking all screened fragments into the entire ensemble of X-ray structures." Taking fragments and re-docking them
repurposing_df = pd.read_csv('repurposing-screen/drugset-docked.csv')
redock_df = pd.read_csv('redock-fragments/all-screened-fragments-docked.csv')
SMILES strings, names, docking scores
repurposing_df.head()
Hybrid2 looks like a docking method provided via OpenEye. Mpro likely refers to COVID-19 main protease. I'm not entirely sure what the receptor for "Hybrid2" is, but there seem to be multiple "sites" or "fragments" for docking. There are lots of different fragments, but very few sites. For each site-fragment combination, multiple small molecules may have been tested.
repurposing_df['docked_fragment'].value_counts()
repurposing_df['site'].value_counts()
repurposing_df.groupby(["docked_fragment", "site"]).count()
Some molecules show up multiple times - why? Upon further investigation, this is mainly due to the molecule's presence in multiple databases
repurposing_df.groupby(['SMILES']).count().sort_values("TITLE")
repurposing_df[repurposing_df['SMILES']=="CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CCC4=CC(=O)CC[C@H]34"]
There doesn't seem to be a very good correlation between the two docking scores - if these are docking scores to different receptors, that would help explain things. It's worth noting that we're not seeing if the two numbers agree for each molecule, but if the trends persist (both scores go up for this molecule, but go down for this other molecule). The weak correlation suggests the trends do not persist between the two docking measures
repurposing_df[['Hybrid2', 'Mpro-_dock']].corr()
Redocking dataframe: SMILES, names, data collection information, docking scores
redock_df.head()
There don't seem to be many Mpro docking scores in this dataset (only one molecule has a non-null Mpro docking score)
redock_df[redock_df['Mpro-x0500_dock'].isnull()].count()
redock_df[~redock_df['Mpro-x0500_dock'].isnull()].count()
Are there overlaps in the molecules in each of these datasets?
repurpose_redock = repurposing_df.merge(redock_df, on='SMILES', how='inner',suffixes=("_L", "_R"))
moonshot_redock = moonshot_df.merge(redock_df, on='SMILES', how='inner',suffixes=("_L", "_R"))
repurpose_redock
We joined on SMILES string, and now we can compare the docking scores between the repurposing and redocking datasets.
Some Hybrid2
scores look quantitatively similar, but for those that don't, the ranking is still there.
Looking at the COVID-19 main protease (Mpro I believe?), the docking scores don't follow similar rankings - docking scores aren't transferable to different receptors (this might be a fairly obvious observation)
repurpose_redock[['SMILES', "TITLE_L", "TITLE_R", "Hybrid2_L", "Hybrid2_R", 'Mpro-_dock', 'Mpro-x0500_dock']]
Joining the moonshot submission and redocking datasets does not yield too many overlapping molecules
moonshot_redock
from chembl_webresource_client.new_client import new_client
molecule = new_client.molecule
res = molecule.search('CHEMBL1387')
res_df = pd.DataFrame.from_dict(res)
res_df.columns
res_df[['chirality', 'molecule_properties', 'molecule_structures', 'score']]
res_df[['molecule_properties']].values[0]
res_df['molecule_properties'].apply(pd.Series)
all_results = [molecule.search(a) for a in repurposing_df['TITLE']]
Here's a big Python function tangent.
For each chembl molecule, we've searched for it within the chembl, returning us a list (of length 1) containing a dictionary of properties.
All molecules have been compiled into a list, so we have a list of lists of dicionatires.
For sanity, we can use a Python filter
to only retain the non-None results.
We can chain that with a Python map
function to parse the first item from each molecule's list.
Recall, each molecule was a list with just one element, a dictionary.
We can boil this down to only returning the dictionary (eliminating the list wrapper).
For validation, I've called next
to look at the results
filtered = map(lambda x: x[0], filter(lambda x: x is not None, all_results))
next(filtered)
For now, I'm only really interested in the molecule_properties
dictionary
filtered = [a[0]['molecule_properties'] for a in all_results if len(a) > 0]
chembl_df = pd.DataFrame(filtered)
chembl_df['TITLE'] = repurposing_df['TITLE']
Molecular properties contained in the chembl database
Here are the definitions I can dig up
- alogp: (lipophilicity) partition coefficient
- aromatic_rings: number of aromatic rings
- cx_logd: distribution coefficient taking into account ionized and non-ionized forms
- cx_most_apka: acidic pka
- cx_most_bpka: basic pka
- full_mwt: molecular weight (and also free base and monoisotopic masses)
- hba: hydrogen bond acceptors (and hba_lipinski for lipinski definitiosn)
- hbd: hydrogen bond donors (and hbd_lipinski)
- heavy_atoms: number of heavy atoms
- num_lipinski_ro5_violations: how many times this molecule violated Lipinski's rule of five
- num_ro5_violations: not sure, seems similar to lipinski rule of 5
- psa: protein sequence alignment
- qed_weighted: "quantitative estimate of druglikeness" (ranges between 0 and 1, with 1 being more favorable). This is based on a quantitatve mean of drugability functions
- ro3_pass: rule of three
- rtb: number of rotatable bonds
chembl_df.head()
chembl_df.columns
chembl_df.corr()
At a glance, no definite linear correlations among this crowd besides pKas, partition coefficients, mwt/hba
corr_df = chembl_df.corr()
cols = chembl_df.columns
fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
ax.imshow(chembl_df.corr(), cmap='RdBu')
ax.set_xticklabels(['']+cols)
ax.tick_params(axis='x', rotation=90)
ax.set_yticklabels(cols)
for i, (rowname, row) in enumerate(corr_df.iterrows()):
for j, (key, val) in enumerate(row.iteritems()):
ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")
Maybe there are higher-order correlations and relationship more appropriate for clustering and decomposition
cols = ['aromatic_rings', 'cx_logp', 'full_mwt', 'hba']
cleaned = (chembl_df[~chembl_df[cols]
.isnull()
.all(axis='columns', skipna=False)][cols]
.astype('float')
.fillna(0, axis='columns'))
from sklearn import preprocessing
normalized = preprocessing.scale(cleaned)
Appears to be maybe 4 clusters of these compounds examined by the covid-moonshot group
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
tsne_analysis = TSNE(n_components=2)
output = tsne_analysis.fit_transform(normalized)
fig,ax = plt.subplots(1,1)
ax.scatter(output[:,0], output[:,1])
ax.set_title("Aromatic rings, cx_logp, mwt, hba")
By taking turns leaving out some features, it looks like leaving out aromatic rings or hydrogen bond acceptors will diminish the cluster distinction.
Aromatic rings are huge and bulky components to small molecules, it makes sense that a chunk of the behavior corresponds to the aromatic rings. Similarly, hydrogen bond acceptors (heavy molecules) also induce van der Waals and electrostatics influences on small molecules. Left with only weight and partition coefficient, there's mainly a continous behavior
def clean_df(cols):
cleaned = (chembl_df[~chembl_df[cols]
.isnull()
.all(axis='columns', skipna=False)][cols]
.astype('float')
.fillna(0, axis='columns'))
normalized = preprocessing.scale(cleaned)
return normalized
cols = ['cx_logp', 'full_mwt', 'hba']
normalized = clean_df(cols)
tsne_analysis = TSNE(n_components=2)
output = tsne_analysis.fit_transform(normalized)
fig,ax = plt.subplots(3,1, figsize=(8,8))
ax[0].scatter(output[:,0], output[:,1])
ax[0].set_title("cx_logp, mwt, hba")
cols = ['cx_logp', 'full_mwt', 'aromatic_rings']
normalized = clean_df(cols)
tsne_analysis = TSNE(n_components=2)
output = tsne_analysis.fit_transform(normalized)
ax[1].scatter(output[:,0], output[:,1])
ax[1].set_title("aromatic_rings, cx_logp, mwt")
cols = ['cx_logp', 'full_mwt']
normalized = clean_df(cols)
ax[2].scatter(normalized[:,0], normalized[:,1])
ax[2].set_title("cx_logp, mwt")
fig.tight_layout()
DrugBank
I found someone had already downloaded the database. I may double-over these dataframes, but query the drugbank dataset rather than chembl
moonshot = pd.read_csv('moonshot-submissions/covid_submissions_all_info-docked-overlap.csv')
moonshot
moonshot.head(5)
moonshot['Mpro-x1418_dock'].isnull().sum() # Lots of missing Mpro dock scores
While there are a lot of different fragments to which the small molecule can bind, there are two "classes", active-covalent and active-noncovalent (possibly referring to sites that covalently bond?)
This presents a way to logically bisect the data based on some fundamental chemistry of the binding pocket.
moonshot['docked_fragment'].value_counts()
moonshot['site'].value_counts()
We can examine the same correlations, but now for each type of site, and look at the hybrid docking score correlations.
The biggest trend differences appear with the partition coefficient and number of hydrogen bond donors, but still the correlations are extremely weak
site_type = 'active-noncovalent'
fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
cols = ['MW', 'cLogP', 'HBD', 'HBA', 'TPSA', 'Hybrid2']
ax.matshow(moonshot[moonshot['site']==site_type][cols].corr(), cmap='RdBu')
ax.set_xticks([i for i,_ in enumerate(cols)])
ax.set_xticklabels(cols)
ax.set_yticks([i for i,_ in enumerate(cols)])
ax.set_yticklabels(cols)
for i, (rowname, row) in enumerate(moonshot[moonshot['site']==site_type][cols].corr().iterrows()):
for j, (key, val) in enumerate(row.iteritems()):
ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")
ax.set_title(f"Docking to {site_type}")
site_type = 'active-covalent'
fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
cols = ['MW', 'cLogP', 'HBD', 'HBA', 'TPSA', 'Hybrid2']
ax.matshow(moonshot[moonshot['site']==site_type][cols].corr(), cmap='RdBu')
ax.set_xticks([i for i,_ in enumerate(cols)])
ax.set_xticklabels(cols)
ax.set_yticks([i for i,_ in enumerate(cols)])
ax.set_yticklabels(cols)
for i, (rowname, row) in enumerate(moonshot[moonshot['site']==site_type][cols].corr().iterrows()):
for j, (key, val) in enumerate(row.iteritems()):
ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")
ax.set_title(f"Docking to {site_type}")
In general, lower docking score seem better, so the noncovalent sites might present more optimal binding locations (see histogram below). This seems non-intuitive because, if active-covalent really means sites that bond covalently, then covalent bonds would seem more energetically favorable than non-covalent interactions. Alternatively, forming covalent bonds might suggest an unstable region of the complex that could be shielded from the surroundings, inhibiting any sort of small molecule from binding the pocket? Expert opinion would be much appreciated here
fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
covalent_mean = moonshot[moonshot['site']=='active-covalent']['Hybrid2'].mean()
noncovalent_mean = moonshot[moonshot['site']=='active-noncovalent']['Hybrid2'].mean()
ax.hist(moonshot[moonshot['site']=='active-covalent']['Hybrid2'], alpha=0.5,
label=f'active-covalent (mean={covalent_mean:.3f})')
ax.hist(moonshot[moonshot['site']=='active-noncovalent']['Hybrid2'], alpha=0.5,
label=f'active-noncovalent (mean={noncovalent_mean:.3f})')
ax.set_title(f"Hybrid2 histogram")
ax.set_xlabel("Hybrid2 score")
ax.legend()
from rdkit import Chem
rdkit_smiles = [Chem.MolFromSmiles(a) for a in moonshot.sort_values('Hybrid2', ascending=True)['SMILES'].head(10)]
scores = [f"{a:.3f}" for a in moonshot.sort_values('Hybrid2', ascending=True)['Hybrid2'].head(10)]
img=Chem.Draw.MolsToGridImage(rdkit_smiles,molsPerRow=5,subImgSize=(200,200),
legends=scores)
img