Accessing FoldingAtHome data on AWS

4 minute read

Published:

Some F@H data is freely accessible on AWS. This will be a relatively short post on accessing and navigating the data on AWS.

If you regularly use AWS, this will be nothing new. If you’re a grad student who has only ever navigated local file directories or used scp/rsync/ssh to interact with remote clusters, this might be your first time interacting with files on AWS S3.

The python environment is fairly straightforward analytical environment, but with s3fs, boto3, and botocore to interact with files on S3

conda create -n fahaws python=3.7 pandas s3fs jupyter ipykernel -c conda-forge -yq

(Active environment)

python -m pip install boto3 botocore

The AWS CLI

The tools to navigate files within AWS directories follow that of unix-like systems. AWS CLI installation.

aws s3 ls s3://fah-public-data-covid19-absolute-free-energy/ --no-sign-request to list files within this particular S3 bucket. The no sign request flag at the end helps us bypass the need for any credentials.

You can read from stdout or pipe the output to a textfile, but this will be your bread and butter for wading through terabytes and terabytes of F@H data.

As of this post (Dec 2020), looks like the files in free_energy_data/ have been last updated end of Sept 2020

Summary of free energy results data

Fortunately, loading remote files via pandas is a common task, so there are convenient functions. Loading a dataframe over S3 is just like loading a dataframe locally (note the S3 string syntax)

The column febkT looks like the binding free energies in units of $k_B T$ (multiply by Boltzmann’s constant and temperature to get energies in kJ or kcal). It’s worth mentioning that the value of the binding free energy is not as helpful as the relative binding free energy to find the best binder of the bunch (how do these free energies compare against each other?)

import pandas as pd
df = pd.read_pickle("s3://fah-public-data-covid19-absolute-free-energy/free_energy_data/results.pkl")
df.head()
datasetfahidentityreceptorscorefebkTerrorns_RLns_Lwl_RLL_errorRL_error
1155MS0323_v3PROJ14822/RUN258DAR-DIA-43a-5protein-0387.pdb-5.201610-25.5469433.773523[131, 89, 74, 113, 80][450, 490, 540, 410, 620][0.18446, 0.14757, 0.18446, 0.18446, 0.18446]0.1169123.280887
609MS0326_v3PROJ14823/RUN1202MUS-SCH-c2f-13Mpro-x0107-protein.pdb-9.550890-25.25942022.776358[121, 138, 96, 16, 5][200, 200, 200, 200, 200][0.18446, 0.18446, 0.23058, 0.23058, 0.23058]16.2163960.109175
759MS0331_v3PROJ14825/RUN685MAK-UNK-129-18Mpro-x0107_0.pdb-8.425830-24.78935918.021078[58, 68, 5, 7][200][0.37782, 0.30226, 0.9224, 0.59034]0.0000009.238496
615MS0326_v3PROJ14823/RUN2911ÁLV-UNI-7ff-30Mpro-x0540-protein.pdb-2.774634-24.4477566.605737[174, 124, 70][200, 200, 200, 200, 200][0.14757, 0.14757, 0.18446]0.0420105.184169
1086MS0326_v3PROJ14823/RUN2580SEL-UNI-842-3Mpro-x0397-protein.pdb-4.474095-23.7053011.248983[166, 134, 45][200, 200, 200, 200, 200][0.18015, 0.22519, 0.35183]0.2125462.529874

Some code to iterate through these buckets

Pythonically, we can build some S3 code to list each object in this S3 bucket.

import boto3
from botocore import UNSIGNED
from botocore.client import Config

s3 = boto3.resource('s3', config=Config(signature_version=UNSIGNED))
s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))

bucket_name = "fah-public-data-covid19-absolute-free-energy"
bucket = s3.Bucket(bucket_name)

This S3 bucket is very large – all the simulation inputs, trajectories, and outputs are in here, so it will take a while to enumerate every object. Instead, we’ll just make a generator and pull out a single item for proof-of-concept.

paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket_name)
def page_iterator(pages):
    for page in pages:
        for item in page['Contents']:
            yield item['Key']
all_objects = page_iterator(pages)
next(all_objects)
'PROJ14377/RUN0/CLONE0/frame0.tpr'

And if you wanted to, you could layer a filter over the generator to impose some logic like filtering for the top-level directories

first_level_dirs = filter(lambda x: x.count('/')==1, all_objects)

Unix-like python filesytem libraries

S3FS, built on botocore and fsspec, has a very unix-like syntax to navigate and open files

import s3fs
fs = s3fs.S3FileSystem(anon=True)
fs.ls(bucket_name)
['fah-public-data-covid19-absolute-free-energy/PROJ14377',
 'fah-public-data-covid19-absolute-free-energy/PROJ14378',
 'fah-public-data-covid19-absolute-free-energy/PROJ14379',
 'fah-public-data-covid19-absolute-free-energy/PROJ14380',
 'fah-public-data-covid19-absolute-free-energy/PROJ14383',
 'fah-public-data-covid19-absolute-free-energy/PROJ14384',
 'fah-public-data-covid19-absolute-free-energy/PROJ14630',
 'fah-public-data-covid19-absolute-free-energy/PROJ14631',
 'fah-public-data-covid19-absolute-free-energy/PROJ14650',
 'fah-public-data-covid19-absolute-free-energy/PROJ14651',
 'fah-public-data-covid19-absolute-free-energy/PROJ14652',
 'fah-public-data-covid19-absolute-free-energy/PROJ14653',
 'fah-public-data-covid19-absolute-free-energy/PROJ14654',
 'fah-public-data-covid19-absolute-free-energy/PROJ14655',
 'fah-public-data-covid19-absolute-free-energy/PROJ14656',
 'fah-public-data-covid19-absolute-free-energy/PROJ14665',
 'fah-public-data-covid19-absolute-free-energy/PROJ14666',
 'fah-public-data-covid19-absolute-free-energy/PROJ14667',
 'fah-public-data-covid19-absolute-free-energy/PROJ14668',
 'fah-public-data-covid19-absolute-free-energy/PROJ14669',
 'fah-public-data-covid19-absolute-free-energy/PROJ14670',
 'fah-public-data-covid19-absolute-free-energy/PROJ14671',
 'fah-public-data-covid19-absolute-free-energy/PROJ14702',
 'fah-public-data-covid19-absolute-free-energy/PROJ14703',
 'fah-public-data-covid19-absolute-free-energy/PROJ14704',
 'fah-public-data-covid19-absolute-free-energy/PROJ14705',
 'fah-public-data-covid19-absolute-free-energy/PROJ14723',
 'fah-public-data-covid19-absolute-free-energy/PROJ14724',
 'fah-public-data-covid19-absolute-free-energy/PROJ14726',
 'fah-public-data-covid19-absolute-free-energy/PROJ14802',
 'fah-public-data-covid19-absolute-free-energy/PROJ14803',
 'fah-public-data-covid19-absolute-free-energy/PROJ14804',
 'fah-public-data-covid19-absolute-free-energy/PROJ14805',
 'fah-public-data-covid19-absolute-free-energy/PROJ14806',
 'fah-public-data-covid19-absolute-free-energy/PROJ14807',
 'fah-public-data-covid19-absolute-free-energy/PROJ14808',
 'fah-public-data-covid19-absolute-free-energy/PROJ14809',
 'fah-public-data-covid19-absolute-free-energy/PROJ14810',
 'fah-public-data-covid19-absolute-free-energy/PROJ14811',
 'fah-public-data-covid19-absolute-free-energy/PROJ14812',
 'fah-public-data-covid19-absolute-free-energy/PROJ14813',
 'fah-public-data-covid19-absolute-free-energy/PROJ14823',
 'fah-public-data-covid19-absolute-free-energy/PROJ14824',
 'fah-public-data-covid19-absolute-free-energy/PROJ14826',
 'fah-public-data-covid19-absolute-free-energy/PROJ14833',
 'fah-public-data-covid19-absolute-free-energy/SVR51748107',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data',
 'fah-public-data-covid19-absolute-free-energy/receptor_structures.tar.gz',
 'fah-public-data-covid19-absolute-free-energy/setup_files']
fs.ls(bucket_name + "/free_energy_data")
['fah-public-data-covid19-absolute-free-energy/free_energy_data/',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_L_14382.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14717.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14718.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14719.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14720.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14817.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14818.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14819.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14820.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/HITS_L_14676.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/HITS_RL_14730.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/HITS_RL_14830.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MLTN_L_14374.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MLTN_RL_14721.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MLTN_RL_14821.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0323_L_14364.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0323_RL_14722.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0323_RL_14822.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_L_14369_14372_14370_14371.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14723.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14724.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14823.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14824.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0331_L_14376.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0331_RL_14725.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0331_RL_14825.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_L_14380.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14727.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14728.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14827.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14828.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406_L_14378.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406_RL_14752.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406_RL_14852.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/hello.txt',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/organization.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/results.pkl']
with fs.open('fah-public-data-covid19-absolute-free-energy/free_energy_data/hello.txt', 'r') as f:
    print(f.read())
hello aws!
with fs.open("fah-public-data-covid19-absolute-free-energy/free_energy_data/organization.pkl", 'rb') as f:
    organization_df = pd.read_pickle(f)
organization_df.head()
datasetidentityreceptorscorev1_projectv1_runv2_projectv2_runv3_projectv3_runprojectrun
072_RLCCNCC(COC)Oc1ccccc1receptor-270-343.pdb0.999790146000147000148000NaNNaN
172_RLO=C(Cc1cccnc1)c1ccccc1receptor-343.pdb0.999652146001147001148001NaNNaN
272_RLCCCCC(N)c1cc(C)ccn1receptor-343.pdb0.999256146002147002148002NaNNaN
372_RLCOCC(C)Nc1ccncn1receptor-343.pdb0.999096146003147003148003NaNNaN
472_RLCCN(CC)CCNc1ccc(C#N)cn1receptor-270-343.pdb0.998980146004147004148004NaNNaN

Notebook itself can be found here