Some F@H data is freely accessible on AWS. This will be a relatively short post on accessing and navigating the data on AWS.

If you regularly use AWS, this will be nothing new. If you're a grad student who has only ever navigated local file directories or used scp/rsync/ssh to interact with remote clusters, this might be your first time interacting with files on AWS S3.

The python environment is fairly straightforward analytical environment, but with s3fs, boto3, and botocore to interact with files on S3

conda create -n fahaws python=3.7 pandas s3fs jupyter ipykernel -c conda-forge -yq

(Active environment)

python -m pip install boto3 botocore

The AWS CLI

The tools to navigate files within AWS directories follow that of unix-like systems. AWS CLI installation.

aws s3 ls s3://fah-public-data-covid19-absolute-free-energy/ --no-sign-request to list files within this particular S3 bucket. The no sign request flag at the end helps us bypass the need for any credentials.

You can read from stdout or pipe the output to a textfile, but this will be your bread and butter for wading through terabytes and terabytes of F@H data.

As of this post (Dec 2020), looks like the files in free_energy_data/ have been last updated end of Sept 2020

Summary of free energy results data

Fortunately, loading remote files via pandas is a common task, so there are convenient functions. Loading a dataframe over S3 is just like loading a dataframe locally (note the S3 string syntax)

The column febkT looks like the binding free energies in units of $k_B T$ (multiply by Boltzmann's constant and temperature to get energies in kJ or kcal). It's worth mentioning that the value of the binding free energy is not as helpful as the relative binding free energy to find the best binder of the bunch (how do these free energies compare against each other?)

import pandas as pd
df = pd.read_pickle("s3://fah-public-data-covid19-absolute-free-energy/free_energy_data/results.pkl")
df.head()
dataset fah identity receptor score febkT error ns_RL ns_L wl_RL L_error RL_error
1155 MS0323_v3 PROJ14822/RUN258 DAR-DIA-43a-5 protein-0387.pdb -5.201610 -25.546943 3.773523 [131, 89, 74, 113, 80] [450, 490, 540, 410, 620] [0.18446, 0.14757, 0.18446, 0.18446, 0.18446] 0.116912 3.280887
609 MS0326_v3 PROJ14823/RUN1202 MUS-SCH-c2f-13 Mpro-x0107-protein.pdb -9.550890 -25.259420 22.776358 [121, 138, 96, 16, 5] [200, 200, 200, 200, 200] [0.18446, 0.18446, 0.23058, 0.23058, 0.23058] 16.216396 0.109175
759 MS0331_v3 PROJ14825/RUN685 MAK-UNK-129-18 Mpro-x0107_0.pdb -8.425830 -24.789359 18.021078 [58, 68, 5, 7] [200] [0.37782, 0.30226, 0.9224, 0.59034] 0.000000 9.238496
615 MS0326_v3 PROJ14823/RUN2911 ÁLV-UNI-7ff-30 Mpro-x0540-protein.pdb -2.774634 -24.447756 6.605737 [174, 124, 70] [200, 200, 200, 200, 200] [0.14757, 0.14757, 0.18446] 0.042010 5.184169
1086 MS0326_v3 PROJ14823/RUN2580 SEL-UNI-842-3 Mpro-x0397-protein.pdb -4.474095 -23.705301 1.248983 [166, 134, 45] [200, 200, 200, 200, 200] [0.18015, 0.22519, 0.35183] 0.212546 2.529874

Some code to iterate through these buckets

Pythonically, we can build some S3 code to list each object in this S3 bucket.

import boto3
from botocore import UNSIGNED
from botocore.client import Config

s3 = boto3.resource('s3', config=Config(signature_version=UNSIGNED))
s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))

bucket_name = "fah-public-data-covid19-absolute-free-energy"
bucket = s3.Bucket(bucket_name)

This S3 bucket is very large -- all the simulation inputs, trajectories, and outputs are in here, so it will take a while to enumerate every object. Instead, we'll just make a generator and pull out a single item for proof-of-concept.

paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket_name)
def page_iterator(pages):
    for page in pages:
        for item in page['Contents']:
            yield item['Key']
all_objects = page_iterator(pages)
next(all_objects)
'PROJ14377/RUN0/CLONE0/frame0.tpr'

And if you wanted to, you could layer a filter over the generator to impose some logic like filtering for the top-level directories

first_level_dirs = filter(lambda x: x.count('/')==1, all_objects)

Unix-like python filesytem libraries

S3FS, built on botocore and fsspec, has a very unix-like syntax to navigate and open files

import s3fs
fs = s3fs.S3FileSystem(anon=True)
fs.ls(bucket_name)
['fah-public-data-covid19-absolute-free-energy/PROJ14377',
 'fah-public-data-covid19-absolute-free-energy/PROJ14378',
 'fah-public-data-covid19-absolute-free-energy/PROJ14379',
 'fah-public-data-covid19-absolute-free-energy/PROJ14380',
 'fah-public-data-covid19-absolute-free-energy/PROJ14383',
 'fah-public-data-covid19-absolute-free-energy/PROJ14384',
 'fah-public-data-covid19-absolute-free-energy/PROJ14630',
 'fah-public-data-covid19-absolute-free-energy/PROJ14631',
 'fah-public-data-covid19-absolute-free-energy/PROJ14650',
 'fah-public-data-covid19-absolute-free-energy/PROJ14651',
 'fah-public-data-covid19-absolute-free-energy/PROJ14652',
 'fah-public-data-covid19-absolute-free-energy/PROJ14653',
 'fah-public-data-covid19-absolute-free-energy/PROJ14654',
 'fah-public-data-covid19-absolute-free-energy/PROJ14655',
 'fah-public-data-covid19-absolute-free-energy/PROJ14656',
 'fah-public-data-covid19-absolute-free-energy/PROJ14665',
 'fah-public-data-covid19-absolute-free-energy/PROJ14666',
 'fah-public-data-covid19-absolute-free-energy/PROJ14667',
 'fah-public-data-covid19-absolute-free-energy/PROJ14668',
 'fah-public-data-covid19-absolute-free-energy/PROJ14669',
 'fah-public-data-covid19-absolute-free-energy/PROJ14670',
 'fah-public-data-covid19-absolute-free-energy/PROJ14671',
 'fah-public-data-covid19-absolute-free-energy/PROJ14702',
 'fah-public-data-covid19-absolute-free-energy/PROJ14703',
 'fah-public-data-covid19-absolute-free-energy/PROJ14704',
 'fah-public-data-covid19-absolute-free-energy/PROJ14705',
 'fah-public-data-covid19-absolute-free-energy/PROJ14723',
 'fah-public-data-covid19-absolute-free-energy/PROJ14724',
 'fah-public-data-covid19-absolute-free-energy/PROJ14726',
 'fah-public-data-covid19-absolute-free-energy/PROJ14802',
 'fah-public-data-covid19-absolute-free-energy/PROJ14803',
 'fah-public-data-covid19-absolute-free-energy/PROJ14804',
 'fah-public-data-covid19-absolute-free-energy/PROJ14805',
 'fah-public-data-covid19-absolute-free-energy/PROJ14806',
 'fah-public-data-covid19-absolute-free-energy/PROJ14807',
 'fah-public-data-covid19-absolute-free-energy/PROJ14808',
 'fah-public-data-covid19-absolute-free-energy/PROJ14809',
 'fah-public-data-covid19-absolute-free-energy/PROJ14810',
 'fah-public-data-covid19-absolute-free-energy/PROJ14811',
 'fah-public-data-covid19-absolute-free-energy/PROJ14812',
 'fah-public-data-covid19-absolute-free-energy/PROJ14813',
 'fah-public-data-covid19-absolute-free-energy/PROJ14823',
 'fah-public-data-covid19-absolute-free-energy/PROJ14824',
 'fah-public-data-covid19-absolute-free-energy/PROJ14826',
 'fah-public-data-covid19-absolute-free-energy/PROJ14833',
 'fah-public-data-covid19-absolute-free-energy/SVR51748107',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data',
 'fah-public-data-covid19-absolute-free-energy/receptor_structures.tar.gz',
 'fah-public-data-covid19-absolute-free-energy/setup_files']
fs.ls(bucket_name + "/free_energy_data")
['fah-public-data-covid19-absolute-free-energy/free_energy_data/',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_L_14382.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14717.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14718.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14719.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14720.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14817.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14818.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14819.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14820.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/HITS_L_14676.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/HITS_RL_14730.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/HITS_RL_14830.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MLTN_L_14374.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MLTN_RL_14721.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MLTN_RL_14821.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0323_L_14364.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0323_RL_14722.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0323_RL_14822.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_L_14369_14372_14370_14371.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14723.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14724.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14823.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14824.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0331_L_14376.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0331_RL_14725.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0331_RL_14825.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_L_14380.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14727.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14728.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14827.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14828.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406_L_14378.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406_RL_14752.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406_RL_14852.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/hello.txt',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/organization.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/results.pkl']
with fs.open('fah-public-data-covid19-absolute-free-energy/free_energy_data/hello.txt', 'r') as f:
    print(f.read())
hello aws!

with fs.open("fah-public-data-covid19-absolute-free-energy/free_energy_data/organization.pkl", 'rb') as f:
    organization_df = pd.read_pickle(f)
organization_df.head()
dataset identity receptor score v1_project v1_run v2_project v2_run v3_project v3_run project run
0 72_RL CCNCC(COC)Oc1ccccc1 receptor-270-343.pdb 0.999790 14600 0 14700 0 14800 0 NaN NaN
1 72_RL O=C(Cc1cccnc1)c1ccccc1 receptor-343.pdb 0.999652 14600 1 14700 1 14800 1 NaN NaN
2 72_RL CCCCC(N)c1cc(C)ccn1 receptor-343.pdb 0.999256 14600 2 14700 2 14800 2 NaN NaN
3 72_RL COCC(C)Nc1ccncn1 receptor-343.pdb 0.999096 14600 3 14700 3 14800 3 NaN NaN
4 72_RL CCN(CC)CCNc1ccc(C#N)cn1 receptor-270-343.pdb 0.998980 14600 4 14700 4 14800 4 NaN NaN