Accessing FoldingAtHome data on AWS
Published:
Some F@H data is freely accessible on AWS. This will be a relatively short post on accessing and navigating the data on AWS.
If you regularly use AWS, this will be nothing new. If you’re a grad student who has only ever navigated local file directories or used scp
/rsync
/ssh
to interact with remote clusters, this might be your first time interacting with files on AWS S3.
The python environment is fairly straightforward analytical environment, but with s3fs, boto3, and botocore to interact with files on S3
conda create -n fahaws python=3.7 pandas s3fs jupyter ipykernel -c conda-forge -yq
(Active environment)
python -m pip install boto3 botocore
The AWS CLI
The tools to navigate files within AWS directories follow that of unix-like systems. AWS CLI installation.
aws s3 ls s3://fah-public-data-covid19-absolute-free-energy/ --no-sign-request
to list files within this particular S3 bucket. The no sign request flag at the end helps us bypass the need for any credentials.
You can read from stdout or pipe the output to a textfile, but this will be your bread and butter for wading through terabytes and terabytes of F@H data.
As of this post (Dec 2020), looks like the files in free_energy_data/
have been last updated end of Sept 2020
Summary of free energy results data
Fortunately, loading remote files via pandas is a common task, so there are convenient functions. Loading a dataframe over S3 is just like loading a dataframe locally (note the S3 string syntax)
The column febkT
looks like the binding free energies in units of $k_B T$ (multiply by Boltzmann’s constant and temperature to get energies in kJ or kcal). It’s worth mentioning that the value of the binding free energy is not as helpful as the relative binding free energy to find the best binder of the bunch (how do these free energies compare against each other?)
import pandas as pd
df = pd.read_pickle("s3://fah-public-data-covid19-absolute-free-energy/free_energy_data/results.pkl")
df.head()
dataset | fah | identity | receptor | score | febkT | error | ns_RL | ns_L | wl_RL | L_error | RL_error | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1155 | MS0323_v3 | PROJ14822/RUN258 | DAR-DIA-43a-5 | protein-0387.pdb | -5.201610 | -25.546943 | 3.773523 | [131, 89, 74, 113, 80] | [450, 490, 540, 410, 620] | [0.18446, 0.14757, 0.18446, 0.18446, 0.18446] | 0.116912 | 3.280887 |
609 | MS0326_v3 | PROJ14823/RUN1202 | MUS-SCH-c2f-13 | Mpro-x0107-protein.pdb | -9.550890 | -25.259420 | 22.776358 | [121, 138, 96, 16, 5] | [200, 200, 200, 200, 200] | [0.18446, 0.18446, 0.23058, 0.23058, 0.23058] | 16.216396 | 0.109175 |
759 | MS0331_v3 | PROJ14825/RUN685 | MAK-UNK-129-18 | Mpro-x0107_0.pdb | -8.425830 | -24.789359 | 18.021078 | [58, 68, 5, 7] | [200] | [0.37782, 0.30226, 0.9224, 0.59034] | 0.000000 | 9.238496 |
615 | MS0326_v3 | PROJ14823/RUN2911 | ÁLV-UNI-7ff-30 | Mpro-x0540-protein.pdb | -2.774634 | -24.447756 | 6.605737 | [174, 124, 70] | [200, 200, 200, 200, 200] | [0.14757, 0.14757, 0.18446] | 0.042010 | 5.184169 |
1086 | MS0326_v3 | PROJ14823/RUN2580 | SEL-UNI-842-3 | Mpro-x0397-protein.pdb | -4.474095 | -23.705301 | 1.248983 | [166, 134, 45] | [200, 200, 200, 200, 200] | [0.18015, 0.22519, 0.35183] | 0.212546 | 2.529874 |
Some code to iterate through these buckets
Pythonically, we can build some S3 code to list each object in this S3 bucket.
import boto3
from botocore import UNSIGNED
from botocore.client import Config
s3 = boto3.resource('s3', config=Config(signature_version=UNSIGNED))
s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))
bucket_name = "fah-public-data-covid19-absolute-free-energy"
bucket = s3.Bucket(bucket_name)
This S3 bucket is very large – all the simulation inputs, trajectories, and outputs are in here, so it will take a while to enumerate every object. Instead, we’ll just make a generator and pull out a single item for proof-of-concept.
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket_name)
def page_iterator(pages):
for page in pages:
for item in page['Contents']:
yield item['Key']
all_objects = page_iterator(pages)
next(all_objects)
'PROJ14377/RUN0/CLONE0/frame0.tpr'
And if you wanted to, you could layer a filter over the generator to impose some logic like filtering for the top-level directories
first_level_dirs = filter(lambda x: x.count('/')==1, all_objects)
Unix-like python filesytem libraries
S3FS, built on botocore and fsspec, has a very unix-like syntax to navigate and open files
import s3fs
fs = s3fs.S3FileSystem(anon=True)
fs.ls(bucket_name)
['fah-public-data-covid19-absolute-free-energy/PROJ14377',
'fah-public-data-covid19-absolute-free-energy/PROJ14378',
'fah-public-data-covid19-absolute-free-energy/PROJ14379',
'fah-public-data-covid19-absolute-free-energy/PROJ14380',
'fah-public-data-covid19-absolute-free-energy/PROJ14383',
'fah-public-data-covid19-absolute-free-energy/PROJ14384',
'fah-public-data-covid19-absolute-free-energy/PROJ14630',
'fah-public-data-covid19-absolute-free-energy/PROJ14631',
'fah-public-data-covid19-absolute-free-energy/PROJ14650',
'fah-public-data-covid19-absolute-free-energy/PROJ14651',
'fah-public-data-covid19-absolute-free-energy/PROJ14652',
'fah-public-data-covid19-absolute-free-energy/PROJ14653',
'fah-public-data-covid19-absolute-free-energy/PROJ14654',
'fah-public-data-covid19-absolute-free-energy/PROJ14655',
'fah-public-data-covid19-absolute-free-energy/PROJ14656',
'fah-public-data-covid19-absolute-free-energy/PROJ14665',
'fah-public-data-covid19-absolute-free-energy/PROJ14666',
'fah-public-data-covid19-absolute-free-energy/PROJ14667',
'fah-public-data-covid19-absolute-free-energy/PROJ14668',
'fah-public-data-covid19-absolute-free-energy/PROJ14669',
'fah-public-data-covid19-absolute-free-energy/PROJ14670',
'fah-public-data-covid19-absolute-free-energy/PROJ14671',
'fah-public-data-covid19-absolute-free-energy/PROJ14702',
'fah-public-data-covid19-absolute-free-energy/PROJ14703',
'fah-public-data-covid19-absolute-free-energy/PROJ14704',
'fah-public-data-covid19-absolute-free-energy/PROJ14705',
'fah-public-data-covid19-absolute-free-energy/PROJ14723',
'fah-public-data-covid19-absolute-free-energy/PROJ14724',
'fah-public-data-covid19-absolute-free-energy/PROJ14726',
'fah-public-data-covid19-absolute-free-energy/PROJ14802',
'fah-public-data-covid19-absolute-free-energy/PROJ14803',
'fah-public-data-covid19-absolute-free-energy/PROJ14804',
'fah-public-data-covid19-absolute-free-energy/PROJ14805',
'fah-public-data-covid19-absolute-free-energy/PROJ14806',
'fah-public-data-covid19-absolute-free-energy/PROJ14807',
'fah-public-data-covid19-absolute-free-energy/PROJ14808',
'fah-public-data-covid19-absolute-free-energy/PROJ14809',
'fah-public-data-covid19-absolute-free-energy/PROJ14810',
'fah-public-data-covid19-absolute-free-energy/PROJ14811',
'fah-public-data-covid19-absolute-free-energy/PROJ14812',
'fah-public-data-covid19-absolute-free-energy/PROJ14813',
'fah-public-data-covid19-absolute-free-energy/PROJ14823',
'fah-public-data-covid19-absolute-free-energy/PROJ14824',
'fah-public-data-covid19-absolute-free-energy/PROJ14826',
'fah-public-data-covid19-absolute-free-energy/PROJ14833',
'fah-public-data-covid19-absolute-free-energy/SVR51748107',
'fah-public-data-covid19-absolute-free-energy/free_energy_data',
'fah-public-data-covid19-absolute-free-energy/receptor_structures.tar.gz',
'fah-public-data-covid19-absolute-free-energy/setup_files']
fs.ls(bucket_name + "/free_energy_data")
['fah-public-data-covid19-absolute-free-energy/free_energy_data/',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_L_14382.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14717.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14718.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14719.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14720.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14817.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14818.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14819.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14820.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/HITS_L_14676.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/HITS_RL_14730.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/HITS_RL_14830.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MLTN_L_14374.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MLTN_RL_14721.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MLTN_RL_14821.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0323_L_14364.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0323_RL_14722.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0323_RL_14822.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_L_14369_14372_14370_14371.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14723.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14724.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14823.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14824.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0331_L_14376.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0331_RL_14725.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0331_RL_14825.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_L_14380.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14727.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14728.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14827.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14828.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406_L_14378.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406_RL_14752.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406_RL_14852.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/hello.txt',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/organization.pkl',
'fah-public-data-covid19-absolute-free-energy/free_energy_data/results.pkl']
with fs.open('fah-public-data-covid19-absolute-free-energy/free_energy_data/hello.txt', 'r') as f:
print(f.read())
hello aws!
with fs.open("fah-public-data-covid19-absolute-free-energy/free_energy_data/organization.pkl", 'rb') as f:
organization_df = pd.read_pickle(f)
organization_df.head()
dataset | identity | receptor | score | v1_project | v1_run | v2_project | v2_run | v3_project | v3_run | project | run | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 72_RL | CCNCC(COC)Oc1ccccc1 | receptor-270-343.pdb | 0.999790 | 14600 | 0 | 14700 | 0 | 14800 | 0 | NaN | NaN |
1 | 72_RL | O=C(Cc1cccnc1)c1ccccc1 | receptor-343.pdb | 0.999652 | 14600 | 1 | 14700 | 1 | 14800 | 1 | NaN | NaN |
2 | 72_RL | CCCCC(N)c1cc(C)ccn1 | receptor-343.pdb | 0.999256 | 14600 | 2 | 14700 | 2 | 14800 | 2 | NaN | NaN |
3 | 72_RL | COCC(C)Nc1ccncn1 | receptor-343.pdb | 0.999096 | 14600 | 3 | 14700 | 3 | 14800 | 3 | NaN | NaN |
4 | 72_RL | CCN(CC)CCNc1ccc(C#N)cn1 | receptor-270-343.pdb | 0.998980 | 14600 | 4 | 14700 | 4 | 14800 | 4 | NaN | NaN |
Notebook itself can be found here