import requests
import pandas as pd
API for Programmatic Access
APIs (Application Programming Interfaces) allow programmatic access to data resources. If you need to fetch and analyse metadata for many Samples, you can write a script to fetch data in this way.
The HoloFood Data Portal API gives programmatic access to the HoloFood Samples and their metadata, as well as URLs for where datasets are stored in public archives.
Canonical URLs
Throughout the API, canonical_url
s are returned which point to the canonical database entry, i.e. the authoritative source, for each data object.
These are:
- BioSamples for Samples (both host-animal level and derived-extraction level).
- The European Nucleotide Archive Browser for nucleotide sequencing data.
- MGnify for metagenomic-derived analyses and MAGs (metagenome assembled genomes)
- MetaboLights for metabolomic analyses
- The websites of various partner institutions and registries where an IRI has been supplied with a metadata entry.
- The HoloFood Data Portal itself for “Analysis Summaries”, which are documents hosted only by this website.
- The HoloFood Data Portal itself for viral annotations
API Endpoints and Playground
The “Browsable API Playground” is the best place to discover the API. Find this under the API
navigation item on the data portal.
The browsable API lets you see the endpoints and their response schemas. Under an endpoint, press Try it out
to see the output for a specific query.
In brief, the top-level endpoints are:
/api/samples
List samples, or fetch details about a specific sample (like its metadata). List all possible metadata markers (i.e. keys).
/api/animals
List animals (hosts), or fetch details about a specific animal (like its metadata and derived samples).
/api/analysis-summaries
List summary analyses published on the data portal.
/api/genome-catalogues
List MAG catalogues, or fetch detail about a catalogue, or list the MAGs within a catalogue.
Using the API
From the command line
Use a command line tool like cURL to query the API. Responses are in JSON format.
For example to list all samples:
curl https://www.holofooddata.org/api/samples
To find the Stearic acid 18:0
data associated with Fatty Acids sample SAMEA112949944
, we could (using jq
to handle JSON data on the command line):
curl https://www.holofooddata.org/api/samples/SAMEA112949944 | jq '.structured_metadata | .[] | select(.marker.name == "Stearic acid 18:0")'
and get
{
"marker": {
"name": "Stearic acid 18:0",
"type": "FATTY ACIDS MG",
"canonical_url": null
},
"measurement": "1,47",
"units": "mg/g"
}
{
"marker": {
"name": "Stearic acid 18:0",
"type": "FATTY ACIDS PERCENTAGE",
"canonical_url": null
},
"measurement": "1,47",
"units": "%"
}
From Python
More realistically, to fetch a list of samples as a Pandas dataframe
pip install requests pandas
= requests.get('https://www.holofooddata.org/api/samples') samples
= pd.json_normalize(samples.json()['items'])
samples_df 3) samples_df.head(
accession | title | sample_type | animal | canonical_url | metagenomics_url | metabolomics_url | |
---|---|---|---|---|---|---|---|
0 | SAMEA10104908 | CA01.07F1a | metagenomic_assembly | SAMEA112905066 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
1 | SAMEA10104910 | CA02.12F1a | metagenomic_assembly | SAMEA112904813 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
2 | SAMEA10104911 | CA02.18F1a | metagenomic_assembly | SAMEA112904777 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
However… we haven’t got all of the data. We only have one page.
The ?page=
URL query parameter lets us retrieve subsequent pages.
len(samples_df)
100
The API response does tell us how many items there are in total:
'count'] samples.json()[
9889
= 'https://www.holofooddata.org/api/samples' samples_endpoint_base
= 1
page
# We will only fetch the first 10 pages for now...
= 10
max_pages
while page and page < max_pages:
print(f'Fetching {page=}')
= requests.get(f'{samples_endpoint_base}?{page=}').json()
samples_page = pd.json_normalize(samples_page['items'])
samples_page_df
if page == 1:
= samples_page_df
samples_df else:
= pd.concat([samples_df, samples_page_df])
samples_df += 1
page if len(samples_df) >= samples_page['count']:
= False page
Fetching page=1
Fetching page=2
Fetching page=3
Fetching page=4
Fetching page=5
Fetching page=6
Fetching page=7
Fetching page=8
Fetching page=9
len(samples_df)
900
Find all metagenomic assembly samples:
samples_df
accession | title | sample_type | animal | canonical_url | metagenomics_url | metabolomics_url | |
---|---|---|---|---|---|---|---|
0 | SAMEA10104908 | CA01.07F1a | metagenomic_assembly | SAMEA112905066 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
1 | SAMEA10104910 | CA02.12F1a | metagenomic_assembly | SAMEA112904813 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
2 | SAMEA10104911 | CA02.18F1a | metagenomic_assembly | SAMEA112904777 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
3 | SAMEA10104912 | CA03.10F1a | metagenomic_assembly | SAMEA112904752 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
4 | SAMEA10104914 | CA04.10F1a | metagenomic_assembly | SAMEA112904915 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
... | ... | ... | ... | ... | ... | ... | ... |
95 | SAMEA112369857 | SB11.04B1a | transcriptomic | SAMEA112949107 | https://www.ebi.ac.uk/biosamples/SAMEA112369857 | None | None |
96 | SAMEA112369858 | SB11.05B1a | transcriptomic | SAMEA112949088 | https://www.ebi.ac.uk/biosamples/SAMEA112369858 | None | None |
97 | SAMEA112369859 | SB11.11B1a | transcriptomic | SAMEA112949878 | https://www.ebi.ac.uk/biosamples/SAMEA112369859 | None | None |
98 | SAMEA112369860 | SB11.12B1a | transcriptomic | SAMEA112948858 | https://www.ebi.ac.uk/biosamples/SAMEA112369860 | None | None |
99 | SAMEA112369861 | SB11.13B1a | transcriptomic | SAMEA112948701 | https://www.ebi.ac.uk/biosamples/SAMEA112369861 | None | None |
900 rows × 7 columns
= samples_df[samples_df.sample_type == 'metagenomic_assembly']
metagenome_assembly_samples print(len(metagenome_assembly_samples))
3) metagenome_assembly_samples.head(
458
accession | title | sample_type | animal | canonical_url | metagenomics_url | metabolomics_url | |
---|---|---|---|---|---|---|---|
0 | SAMEA10104908 | CA01.07F1a | metagenomic_assembly | SAMEA112905066 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
1 | SAMEA10104910 | CA02.12F1a | metagenomic_assembly | SAMEA112904813 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
2 | SAMEA10104911 | CA02.18F1a | metagenomic_assembly | SAMEA112904777 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
Filters using query parameters
We didn’t need to fetch all the samples to then select only the metagenomic ones. We can instead fetch only the metagenomic ones using a query parameter filter. (If you’re familiar with SQL, this eventually maps to a WHERE
clause on the database.)
= 1
page
while page:
print(f'Fetching {page=}')
= requests.get(f'{samples_endpoint_base}?{page=}&sample_type=metagenomic_assembly').json()
mg_samples_page = pd.json_normalize(mg_samples_page['items'])
mg_samples_page_df
if page == 1:
= mg_samples_page_df
mg_samples_df else:
= pd.concat([mg_samples_df, mg_samples_page_df])
mg_samples_df += 1
page if len(mg_samples_df) >= mg_samples_page['count']:
= False page
Fetching page=1
Fetching page=2
Fetching page=3
Fetching page=4
Fetching page=5
Fetching page=6
Fetching page=7
Fetching page=8
Fetching page=9
Fetching page=10
Fetching page=11
Fetching page=12
Fetching page=13
Fetching page=14
samples_df
accession | title | sample_type | animal | canonical_url | metagenomics_url | metabolomics_url | |
---|---|---|---|---|---|---|---|
0 | SAMEA10104908 | CA01.07F1a | metagenomic_assembly | SAMEA112905066 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
1 | SAMEA10104910 | CA02.12F1a | metagenomic_assembly | SAMEA112904813 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
2 | SAMEA10104911 | CA02.18F1a | metagenomic_assembly | SAMEA112904777 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
3 | SAMEA10104912 | CA03.10F1a | metagenomic_assembly | SAMEA112904752 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
4 | SAMEA10104914 | CA04.10F1a | metagenomic_assembly | SAMEA112904915 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
... | ... | ... | ... | ... | ... | ... | ... |
78 | SAMEA9449961 | CC17.06F1a | metagenomic_assembly | SAMEA112905387 | https://www.ebi.ac.uk/ena/browser/view/SAMEA94... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
79 | SAMEA9449962 | CC17.10F1a | metagenomic_assembly | SAMEA112905371 | https://www.ebi.ac.uk/ena/browser/view/SAMEA94... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
80 | SAMEA9449963 | CC16.16F1a | metagenomic_assembly | SAMEA112905164 | https://www.ebi.ac.uk/ena/browser/view/SAMEA94... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
81 | SAMEA9449964 | CC18.03F1a | metagenomic_assembly | SAMEA112904811 | https://www.ebi.ac.uk/ena/browser/view/SAMEA94... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
82 | SAMEA9449965 | CC18.05F1a | metagenomic_assembly | SAMEA112905140 | https://www.ebi.ac.uk/ena/browser/view/SAMEA94... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
1383 rows × 7 columns
And so, we’ve more efficiently fetch the metagenomic assemblies – by not pulling pages and pages of irrelevant data.
MGnify
You can obtain information about the analyses present on MGnify for a certain sample, using the metagenomic_url
returned by the data portal API:
= requests.get('https://www.holofooddata.org/api/samples/SAMEA13604493')
response = response.json().get('metagenomics_url')
mgnify_url mgnify_url
'https://www.ebi.ac.uk/metagenomics/api/v1/samples/SAMEA13604493'
You can then use that MGnify API endopint to fetch information about that sample and follow related links through MGnify:
= requests.get(mgnify_url)
mgnify_response = mgnify_response.json().get('data')
mgnify_data = mgnify_data['relationships']['runs']['links']['related']
runs_url
= requests.get(runs_url)
mgnify_runs_response = mgnify_runs_response.json().get('data')
mgnify_runs_data mgnify_runs_data
[{'type': 'runs',
'id': 'ERR9358532',
'attributes': {'experiment-type': 'metagenomic',
'is-private': False,
'accession': 'ERR9358532',
'secondary-accession': 'ERR9358532',
'ena-study-accession': None,
'instrument-platform': 'ILLUMINA',
'instrument-model': 'Illumina NovaSeq 6000'},
'relationships': {'analyses': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/runs/ERR9358532/analyses'}},
'assemblies': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/runs/ERR9358532/assemblies'},
'data': [{'type': 'assemblies',
'id': 'ERZ13633665',
'links': {'self': 'https://www.ebi.ac.uk/metagenomics/api/v1/assemblies/ERZ13633665'}}]},
'study': {'data': {'type': 'studies', 'id': 'MGYS00006086'},
'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/studies/MGYS00006086'}},
'sample': {'data': {'type': 'samples', 'id': 'ERS11206669'},
'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/samples/ERS11206669'}},
'pipelines': {'data': []}},
'links': {'self': 'https://www.ebi.ac.uk/metagenomics/api/v1/runs/ERR9358532'}}]
In most cases, you can also use the base endpoint you’re interested in (e.g. /analyses
) with a filter based on the sample accession. E.g. to find a MGnify analyses of a sample:
= requests.get('https://www.ebi.ac.uk/metagenomics/api/v1/analyses?sample_accession=SAMEA13604493')
mgnify_response 'data') mgnify_response.json().get(
[{'type': 'analysis-jobs',
'id': 'MGYA00617383',
'attributes': {'analysis-status': 'completed',
'accession': 'MGYA00617383',
'experiment-type': 'assembly',
'analysis-summary': [{'key': 'Submitted nucleotide sequences',
'value': '344'},
{'key': 'Nucleotide sequences after format-specific filtering',
'value': '344'},
{'key': 'Nucleotide sequences after length filtering', 'value': '344'},
{'key': 'Nucleotide sequences after undetermined bases filtering',
'value': '344'},
{'key': 'Reads with predicted CDS', 'value': '330'},
{'key': 'Reads with predicted RNA', 'value': '13'},
{'key': 'Reads with InterProScan match', 'value': '108'},
{'key': 'Predicted CDS', 'value': '531'},
{'key': 'Predicted CDS with InterProScan match', 'value': '127'},
{'key': 'Total InterProScan matches', 'value': '299'},
{'key': 'Predicted SSU sequences', 'value': '0'},
{'key': 'Predicted LSU sequences', 'value': '2'}],
'pipeline-version': '5.0',
'is-private': False,
'complete-time': '2022-12-01T19:35:07',
'instrument-platform': 'ILLUMINA',
'instrument-model': 'Illumina NovaSeq 6000'},
'relationships': {'taxonomy-lsu': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/lsu'}},
'go-slim': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/go-slim'}},
'antismash-gene-clusters': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/antismash-gene-clusters'}},
'genome-properties': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/genome-properties'}},
'run': {'data': None},
'study': {'data': {'type': 'studies', 'id': 'MGYS00006086'},
'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/studies/MGYS00006086'}},
'go-terms': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/go-terms'}},
'taxonomy-itsunite': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/unite'}},
'sample': {'data': {'type': 'samples', 'id': 'ERS11206669'},
'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/samples/ERS11206669'}},
'taxonomy-itsonedb': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/itsonedb'}},
'taxonomy': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy'}},
'assembly': {'data': {'type': 'assemblies', 'id': 'ERZ13633665'},
'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/assemblies/ERZ13633665'}},
'downloads': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/downloads'}},
'interpro-identifiers': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/interpro-identifiers'}},
'taxonomy-ssu': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/ssu'}}},
'links': {'self': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383'}}]
MetaboLights
You can find the MetaboLights URL for a study from the portal API:
= requests.get('https://www.holofooddata.org/api/samples/SAMEA112907471')
mtbls_response = mtbls_response.json().get('metabolomics_url')
mtbls_url mtbls_url
'https://www.ebi.ac.uk/metabolights/ws/studies/MTBLS6988'
This URL is for the entire study on MetaboLights, rather than a specific sample. The data portal website (but not the API, for performance reasons) navigate the MetaboLights API in order to find the files relevant to a particular sample.
The code that does this in the data portal shows how this can be achieved. In short: fetch the files list from the MetaboLights API, and find the metadata_sample
file (it is an ISA-TAB file). Look through that file for the the row where Comment[BioSamples accession]
is your sample(s) of interest. This field may have slightly different names in different studies. This lets you find the Sample Name
corresponding to the BioSample accession.
Then, find the metadata_assay
file(s) from the file list. Concatenate all of these metadata_assay
tab files, and find all the rows from all of the metadata_assay
sheets where the Sample Name
matches.
These assay sheets will have columns of interest about the assays used, the raw and derived files, and the metabolites association files. These are filenames which you can then download.