import requests
import pandas as pdAPI for Programmatic Access
APIs (Application Programming Interfaces) allow programmatic access to data resources. If you need to fetch and analyse metadata for many Samples, you can write a script to fetch data in this way.
The HoloFood Data Portal API gives programmatic access to the HoloFood Samples and their metadata, as well as URLs for where datasets are stored in public archives.
Canonical URLs
Throughout the API, canonical_urls are returned which point to the canonical database entry, i.e. the authoritative source, for each data object.
These are:
- BioSamples for Samples (both host-animal level and derived-extraction level).
- The European Nucleotide Archive Browser for nucleotide sequencing data.
- MGnify for metagenomic-derived analyses and MAGs (metagenome assembled genomes)
- MetaboLights for metabolomic analyses
- The websites of various partner institutions and registries where an IRI has been supplied with a metadata entry.
- The HoloFood Data Portal itself for “Analysis Summaries”, which are documents hosted only by this website.
- The HoloFood Data Portal itself for viral annotations
API Endpoints and Playground
The “Browsable API Playground” is the best place to discover the API. Find this under the API navigation item on the data portal.
The browsable API lets you see the endpoints and their response schemas. Under an endpoint, press Try it out to see the output for a specific query.
In brief, the top-level endpoints are:
/api/samples
List samples, or fetch details about a specific sample (like its metadata). List all possible metadata markers (i.e. keys).
/api/animals
List animals (hosts), or fetch details about a specific animal (like its metadata and derived samples).
/api/analysis-summaries
List summary analyses published on the data portal.
/api/genome-catalogues
List MAG catalogues, or fetch detail about a catalogue, or list the MAGs within a catalogue.
Using the API
From the command line
Use a command line tool like cURL to query the API. Responses are in JSON format.
For example to list all samples:
curl https://www.holofooddata.org/api/samplesTo find the Stearic acid 18:0 data associated with Fatty Acids sample SAMEA112949944, we could (using jq to handle JSON data on the command line):
curl https://www.holofooddata.org/api/samples/SAMEA112949944 | jq '.structured_metadata | .[] | select(.marker.name == "Stearic acid 18:0")'and get
{
"marker": {
"name": "Stearic acid 18:0",
"type": "FATTY ACIDS MG",
"canonical_url": null
},
"measurement": "1,47",
"units": "mg/g"
}
{
"marker": {
"name": "Stearic acid 18:0",
"type": "FATTY ACIDS PERCENTAGE",
"canonical_url": null
},
"measurement": "1,47",
"units": "%"
}From Python
More realistically, to fetch a list of samples as a Pandas dataframe
pip install requests pandassamples = requests.get('https://www.holofooddata.org/api/samples')samples_df = pd.json_normalize(samples.json()['items'])
samples_df.head(3)| accession | title | sample_type | animal | canonical_url | metagenomics_url | metabolomics_url | |
|---|---|---|---|---|---|---|---|
| 0 | SAMEA10104908 | CA01.07F1a | metagenomic_assembly | SAMEA112905066 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 1 | SAMEA10104910 | CA02.12F1a | metagenomic_assembly | SAMEA112904813 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 2 | SAMEA10104911 | CA02.18F1a | metagenomic_assembly | SAMEA112904777 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
However… we haven’t got all of the data. We only have one page.
The ?page= URL query parameter lets us retrieve subsequent pages.
len(samples_df)100
The API response does tell us how many items there are in total:
samples.json()['count']9889
samples_endpoint_base = 'https://www.holofooddata.org/api/samples'page = 1
# We will only fetch the first 10 pages for now...
max_pages = 10
while page and page < max_pages:
print(f'Fetching {page=}')
samples_page = requests.get(f'{samples_endpoint_base}?{page=}').json()
samples_page_df = pd.json_normalize(samples_page['items'])
if page == 1:
samples_df = samples_page_df
else:
samples_df = pd.concat([samples_df, samples_page_df])
page += 1
if len(samples_df) >= samples_page['count']:
page = FalseFetching page=1
Fetching page=2
Fetching page=3
Fetching page=4
Fetching page=5
Fetching page=6
Fetching page=7
Fetching page=8
Fetching page=9
len(samples_df)900
Find all metagenomic assembly samples:
samples_df| accession | title | sample_type | animal | canonical_url | metagenomics_url | metabolomics_url | |
|---|---|---|---|---|---|---|---|
| 0 | SAMEA10104908 | CA01.07F1a | metagenomic_assembly | SAMEA112905066 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 1 | SAMEA10104910 | CA02.12F1a | metagenomic_assembly | SAMEA112904813 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 2 | SAMEA10104911 | CA02.18F1a | metagenomic_assembly | SAMEA112904777 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 3 | SAMEA10104912 | CA03.10F1a | metagenomic_assembly | SAMEA112904752 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 4 | SAMEA10104914 | CA04.10F1a | metagenomic_assembly | SAMEA112904915 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | SAMEA112369857 | SB11.04B1a | transcriptomic | SAMEA112949107 | https://www.ebi.ac.uk/biosamples/SAMEA112369857 | None | None |
| 96 | SAMEA112369858 | SB11.05B1a | transcriptomic | SAMEA112949088 | https://www.ebi.ac.uk/biosamples/SAMEA112369858 | None | None |
| 97 | SAMEA112369859 | SB11.11B1a | transcriptomic | SAMEA112949878 | https://www.ebi.ac.uk/biosamples/SAMEA112369859 | None | None |
| 98 | SAMEA112369860 | SB11.12B1a | transcriptomic | SAMEA112948858 | https://www.ebi.ac.uk/biosamples/SAMEA112369860 | None | None |
| 99 | SAMEA112369861 | SB11.13B1a | transcriptomic | SAMEA112948701 | https://www.ebi.ac.uk/biosamples/SAMEA112369861 | None | None |
900 rows × 7 columns
metagenome_assembly_samples = samples_df[samples_df.sample_type == 'metagenomic_assembly']
print(len(metagenome_assembly_samples))
metagenome_assembly_samples.head(3)458
| accession | title | sample_type | animal | canonical_url | metagenomics_url | metabolomics_url | |
|---|---|---|---|---|---|---|---|
| 0 | SAMEA10104908 | CA01.07F1a | metagenomic_assembly | SAMEA112905066 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 1 | SAMEA10104910 | CA02.12F1a | metagenomic_assembly | SAMEA112904813 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 2 | SAMEA10104911 | CA02.18F1a | metagenomic_assembly | SAMEA112904777 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
Filters using query parameters
We didn’t need to fetch all the samples to then select only the metagenomic ones. We can instead fetch only the metagenomic ones using a query parameter filter. (If you’re familiar with SQL, this eventually maps to a WHERE clause on the database.)
page = 1
while page:
print(f'Fetching {page=}')
mg_samples_page = requests.get(f'{samples_endpoint_base}?{page=}&sample_type=metagenomic_assembly').json()
mg_samples_page_df = pd.json_normalize(mg_samples_page['items'])
if page == 1:
mg_samples_df = mg_samples_page_df
else:
mg_samples_df = pd.concat([mg_samples_df, mg_samples_page_df])
page += 1
if len(mg_samples_df) >= mg_samples_page['count']:
page = FalseFetching page=1
Fetching page=2
Fetching page=3
Fetching page=4
Fetching page=5
Fetching page=6
Fetching page=7
Fetching page=8
Fetching page=9
Fetching page=10
Fetching page=11
Fetching page=12
Fetching page=13
Fetching page=14
samples_df| accession | title | sample_type | animal | canonical_url | metagenomics_url | metabolomics_url | |
|---|---|---|---|---|---|---|---|
| 0 | SAMEA10104908 | CA01.07F1a | metagenomic_assembly | SAMEA112905066 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 1 | SAMEA10104910 | CA02.12F1a | metagenomic_assembly | SAMEA112904813 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 2 | SAMEA10104911 | CA02.18F1a | metagenomic_assembly | SAMEA112904777 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 3 | SAMEA10104912 | CA03.10F1a | metagenomic_assembly | SAMEA112904752 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 4 | SAMEA10104914 | CA04.10F1a | metagenomic_assembly | SAMEA112904915 | https://www.ebi.ac.uk/ena/browser/view/SAMEA10... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 78 | SAMEA9449961 | CC17.06F1a | metagenomic_assembly | SAMEA112905387 | https://www.ebi.ac.uk/ena/browser/view/SAMEA94... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 79 | SAMEA9449962 | CC17.10F1a | metagenomic_assembly | SAMEA112905371 | https://www.ebi.ac.uk/ena/browser/view/SAMEA94... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 80 | SAMEA9449963 | CC16.16F1a | metagenomic_assembly | SAMEA112905164 | https://www.ebi.ac.uk/ena/browser/view/SAMEA94... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 81 | SAMEA9449964 | CC18.03F1a | metagenomic_assembly | SAMEA112904811 | https://www.ebi.ac.uk/ena/browser/view/SAMEA94... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
| 82 | SAMEA9449965 | CC18.05F1a | metagenomic_assembly | SAMEA112905140 | https://www.ebi.ac.uk/ena/browser/view/SAMEA94... | https://www.ebi.ac.uk/metagenomics/api/v1/samp... | None |
1383 rows × 7 columns
And so, we’ve more efficiently fetch the metagenomic assemblies – by not pulling pages and pages of irrelevant data.
MGnify
You can obtain information about the analyses present on MGnify for a certain sample, using the metagenomic_url returned by the data portal API:
response = requests.get('https://www.holofooddata.org/api/samples/SAMEA13604493')
mgnify_url = response.json().get('metagenomics_url')
mgnify_url'https://www.ebi.ac.uk/metagenomics/api/v1/samples/SAMEA13604493'
You can then use that MGnify API endopint to fetch information about that sample and follow related links through MGnify:
mgnify_response = requests.get(mgnify_url)
mgnify_data = mgnify_response.json().get('data')
runs_url = mgnify_data['relationships']['runs']['links']['related']
mgnify_runs_response = requests.get(runs_url)
mgnify_runs_data = mgnify_runs_response.json().get('data')
mgnify_runs_data[{'type': 'runs',
'id': 'ERR9358532',
'attributes': {'experiment-type': 'metagenomic',
'is-private': False,
'accession': 'ERR9358532',
'secondary-accession': 'ERR9358532',
'ena-study-accession': None,
'instrument-platform': 'ILLUMINA',
'instrument-model': 'Illumina NovaSeq 6000'},
'relationships': {'analyses': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/runs/ERR9358532/analyses'}},
'assemblies': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/runs/ERR9358532/assemblies'},
'data': [{'type': 'assemblies',
'id': 'ERZ13633665',
'links': {'self': 'https://www.ebi.ac.uk/metagenomics/api/v1/assemblies/ERZ13633665'}}]},
'study': {'data': {'type': 'studies', 'id': 'MGYS00006086'},
'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/studies/MGYS00006086'}},
'sample': {'data': {'type': 'samples', 'id': 'ERS11206669'},
'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/samples/ERS11206669'}},
'pipelines': {'data': []}},
'links': {'self': 'https://www.ebi.ac.uk/metagenomics/api/v1/runs/ERR9358532'}}]
In most cases, you can also use the base endpoint you’re interested in (e.g. /analyses) with a filter based on the sample accession. E.g. to find a MGnify analyses of a sample:
mgnify_response = requests.get('https://www.ebi.ac.uk/metagenomics/api/v1/analyses?sample_accession=SAMEA13604493')
mgnify_response.json().get('data')[{'type': 'analysis-jobs',
'id': 'MGYA00617383',
'attributes': {'analysis-status': 'completed',
'accession': 'MGYA00617383',
'experiment-type': 'assembly',
'analysis-summary': [{'key': 'Submitted nucleotide sequences',
'value': '344'},
{'key': 'Nucleotide sequences after format-specific filtering',
'value': '344'},
{'key': 'Nucleotide sequences after length filtering', 'value': '344'},
{'key': 'Nucleotide sequences after undetermined bases filtering',
'value': '344'},
{'key': 'Reads with predicted CDS', 'value': '330'},
{'key': 'Reads with predicted RNA', 'value': '13'},
{'key': 'Reads with InterProScan match', 'value': '108'},
{'key': 'Predicted CDS', 'value': '531'},
{'key': 'Predicted CDS with InterProScan match', 'value': '127'},
{'key': 'Total InterProScan matches', 'value': '299'},
{'key': 'Predicted SSU sequences', 'value': '0'},
{'key': 'Predicted LSU sequences', 'value': '2'}],
'pipeline-version': '5.0',
'is-private': False,
'complete-time': '2022-12-01T19:35:07',
'instrument-platform': 'ILLUMINA',
'instrument-model': 'Illumina NovaSeq 6000'},
'relationships': {'taxonomy-lsu': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/lsu'}},
'go-slim': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/go-slim'}},
'antismash-gene-clusters': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/antismash-gene-clusters'}},
'genome-properties': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/genome-properties'}},
'run': {'data': None},
'study': {'data': {'type': 'studies', 'id': 'MGYS00006086'},
'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/studies/MGYS00006086'}},
'go-terms': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/go-terms'}},
'taxonomy-itsunite': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/unite'}},
'sample': {'data': {'type': 'samples', 'id': 'ERS11206669'},
'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/samples/ERS11206669'}},
'taxonomy-itsonedb': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/itsonedb'}},
'taxonomy': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy'}},
'assembly': {'data': {'type': 'assemblies', 'id': 'ERZ13633665'},
'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/assemblies/ERZ13633665'}},
'downloads': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/downloads'}},
'interpro-identifiers': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/interpro-identifiers'}},
'taxonomy-ssu': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/ssu'}}},
'links': {'self': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383'}}]
MetaboLights
You can find the MetaboLights URL for a study from the portal API:
mtbls_response = requests.get('https://www.holofooddata.org/api/samples/SAMEA112907471')
mtbls_url = mtbls_response.json().get('metabolomics_url')
mtbls_url'https://www.ebi.ac.uk/metabolights/ws/studies/MTBLS6988'
This URL is for the entire study on MetaboLights, rather than a specific sample. The data portal website (but not the API, for performance reasons) navigate the MetaboLights API in order to find the files relevant to a particular sample.
The code that does this in the data portal shows how this can be achieved. In short: fetch the files list from the MetaboLights API, and find the metadata_sample file (it is an ISA-TAB file). Look through that file for the the row where Comment[BioSamples accession] is your sample(s) of interest. This field may have slightly different names in different studies. This lets you find the Sample Name corresponding to the BioSample accession.
Then, find the metadata_assay file(s) from the file list. Concatenate all of these metadata_assay tab files, and find all the rows from all of the metadata_assay sheets where the Sample Name matches.
These assay sheets will have columns of interest about the assays used, the raw and derived files, and the metabolites association files. These are filenames which you can then download.
From R
Users of the R language may wish to use the HoloFoodR package to access the API.