API for Programmatic Access

APIs (Application Programming Interfaces) allow programmatic access to data resources. If you need to fetch and analyse metadata for many Samples, you can write a script to fetch data in this way.

The HoloFood Data Portal API gives programmatic access to the HoloFood Samples and their metadata, as well as URLs for where datasets are stored in public archives.

Canonical URLs

Throughout the API, canonical_urls are returned which point to the canonical database entry, i.e. the authoritative source, for each data object.

These are:

  • BioSamples for Samples (both host-animal level and derived-extraction level).
  • The European Nucleotide Archive Browser for nucleotide sequencing data.
  • MGnify for metagenomic-derived analyses and MAGs (metagenome assembled genomes)
  • MetaboLights for metabolomic analyses
  • The websites of various partner institutions and registries where an IRI has been supplied with a metadata entry.
  • The HoloFood Data Portal itself for “Analysis Summaries”, which are documents hosted only by this website.
  • The HoloFood Data Portal itself for viral annotations

API Endpoints and Playground

The “Browsable API Playground” is the best place to discover the API. Find this under the API navigation item on the data portal.

The browsable API lets you see the endpoints and their response schemas. Under an endpoint, press Try it out to see the output for a specific query.

In brief, the top-level endpoints are:

/api/samples

List samples, or fetch details about a specific sample (like its metadata). List all possible metadata markers (i.e. keys).

/api/animals

List animals (hosts), or fetch details about a specific animal (like its metadata and derived samples).

/api/analysis-summaries

List summary analyses published on the data portal.

/api/genome-catalogues

List MAG catalogues, or fetch detail about a catalogue, or list the MAGs within a catalogue.

/api/viral-catalogues

List Viral catalogues, or fetch detail about a catalogue, or list the fragments within a catalogue.

Using the API

From the command line

Use a command line tool like cURL to query the API. Responses are in JSON format.

For example to list all samples:

curl https://www.holofooddata.org/api/samples

To find the Stearic acid 18:0 data associated with Fatty Acids sample SAMEA112949944, we could (using jq to handle JSON data on the command line):

curl https://www.holofooddata.org/api/samples/SAMEA112949944 | jq '.structured_metadata | .[] | select(.marker.name == "Stearic acid 18:0")'

and get

{
  "marker": {
    "name": "Stearic acid 18:0",
    "type": "FATTY ACIDS MG",
    "canonical_url": null
  },
  "measurement": "1,47",
  "units": "mg/g"
}
{
  "marker": {
    "name": "Stearic acid 18:0",
    "type": "FATTY ACIDS PERCENTAGE",
    "canonical_url": null
  },
  "measurement": "1,47",
  "units": "%"
}

From Python

More realistically, to fetch a list of samples as a Pandas dataframe

Packages required
pip install requests pandas
import requests
import pandas as pd
samples = requests.get('https://www.holofooddata.org/api/samples')
samples_df = pd.json_normalize(samples.json()['items'])
samples_df.head(3)
accession title sample_type animal canonical_url metagenomics_url metabolomics_url
0 SAMEA10104908 CA01.07F1a metagenomic_assembly SAMEA112905066 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
1 SAMEA10104910 CA02.12F1a metagenomic_assembly SAMEA112904813 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
2 SAMEA10104911 CA02.18F1a metagenomic_assembly SAMEA112904777 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
Paginated data

However… we haven’t got all of the data. We only have one page.

The ?page= URL query parameter lets us retrieve subsequent pages.

len(samples_df)
100

The API response does tell us how many items there are in total:

samples.json()['count']
9889
samples_endpoint_base = 'https://www.holofooddata.org/api/samples'
page = 1

# We will only fetch the first 10 pages for now...

max_pages = 10

while page and page < max_pages:
    print(f'Fetching {page=}')
    
    samples_page = requests.get(f'{samples_endpoint_base}?{page=}').json()
    samples_page_df = pd.json_normalize(samples_page['items'])
    
    if page == 1:
        samples_df = samples_page_df
    else:
        samples_df = pd.concat([samples_df, samples_page_df])
    page += 1
    if len(samples_df) >= samples_page['count']:
        page = False
Fetching page=1
Fetching page=2
Fetching page=3
Fetching page=4
Fetching page=5
Fetching page=6
Fetching page=7
Fetching page=8
Fetching page=9
len(samples_df)
900

Find all metagenomic assembly samples:

samples_df
accession title sample_type animal canonical_url metagenomics_url metabolomics_url
0 SAMEA10104908 CA01.07F1a metagenomic_assembly SAMEA112905066 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
1 SAMEA10104910 CA02.12F1a metagenomic_assembly SAMEA112904813 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
2 SAMEA10104911 CA02.18F1a metagenomic_assembly SAMEA112904777 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
3 SAMEA10104912 CA03.10F1a metagenomic_assembly SAMEA112904752 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
4 SAMEA10104914 CA04.10F1a metagenomic_assembly SAMEA112904915 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
... ... ... ... ... ... ... ...
95 SAMEA112369857 SB11.04B1a transcriptomic SAMEA112949107 https://www.ebi.ac.uk/biosamples/SAMEA112369857 None None
96 SAMEA112369858 SB11.05B1a transcriptomic SAMEA112949088 https://www.ebi.ac.uk/biosamples/SAMEA112369858 None None
97 SAMEA112369859 SB11.11B1a transcriptomic SAMEA112949878 https://www.ebi.ac.uk/biosamples/SAMEA112369859 None None
98 SAMEA112369860 SB11.12B1a transcriptomic SAMEA112948858 https://www.ebi.ac.uk/biosamples/SAMEA112369860 None None
99 SAMEA112369861 SB11.13B1a transcriptomic SAMEA112948701 https://www.ebi.ac.uk/biosamples/SAMEA112369861 None None

900 rows × 7 columns

metagenome_assembly_samples = samples_df[samples_df.sample_type == 'metagenomic_assembly']
print(len(metagenome_assembly_samples))
metagenome_assembly_samples.head(3)
458
accession title sample_type animal canonical_url metagenomics_url metabolomics_url
0 SAMEA10104908 CA01.07F1a metagenomic_assembly SAMEA112905066 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
1 SAMEA10104910 CA02.12F1a metagenomic_assembly SAMEA112904813 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
2 SAMEA10104911 CA02.18F1a metagenomic_assembly SAMEA112904777 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None

Filters using query parameters

We didn’t need to fetch all the samples to then select only the metagenomic ones. We can instead fetch only the metagenomic ones using a query parameter filter. (If you’re familiar with SQL, this eventually maps to a WHERE clause on the database.)

page = 1

while page:
    print(f'Fetching {page=}')
    
    mg_samples_page = requests.get(f'{samples_endpoint_base}?{page=}&sample_type=metagenomic_assembly').json()
    mg_samples_page_df = pd.json_normalize(mg_samples_page['items'])
    
    if page == 1:
        mg_samples_df = mg_samples_page_df
    else:
        mg_samples_df = pd.concat([mg_samples_df, mg_samples_page_df])
    page += 1
    if len(mg_samples_df) >= mg_samples_page['count']:
        page = False
Fetching page=1
Fetching page=2
Fetching page=3
Fetching page=4
Fetching page=5
Fetching page=6
Fetching page=7
Fetching page=8
Fetching page=9
Fetching page=10
Fetching page=11
Fetching page=12
Fetching page=13
Fetching page=14
samples_df
accession title sample_type animal canonical_url metagenomics_url metabolomics_url
0 SAMEA10104908 CA01.07F1a metagenomic_assembly SAMEA112905066 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
1 SAMEA10104910 CA02.12F1a metagenomic_assembly SAMEA112904813 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
2 SAMEA10104911 CA02.18F1a metagenomic_assembly SAMEA112904777 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
3 SAMEA10104912 CA03.10F1a metagenomic_assembly SAMEA112904752 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
4 SAMEA10104914 CA04.10F1a metagenomic_assembly SAMEA112904915 https://www.ebi.ac.uk/ena/browser/view/SAMEA10... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
... ... ... ... ... ... ... ...
78 SAMEA9449961 CC17.06F1a metagenomic_assembly SAMEA112905387 https://www.ebi.ac.uk/ena/browser/view/SAMEA94... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
79 SAMEA9449962 CC17.10F1a metagenomic_assembly SAMEA112905371 https://www.ebi.ac.uk/ena/browser/view/SAMEA94... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
80 SAMEA9449963 CC16.16F1a metagenomic_assembly SAMEA112905164 https://www.ebi.ac.uk/ena/browser/view/SAMEA94... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
81 SAMEA9449964 CC18.03F1a metagenomic_assembly SAMEA112904811 https://www.ebi.ac.uk/ena/browser/view/SAMEA94... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None
82 SAMEA9449965 CC18.05F1a metagenomic_assembly SAMEA112905140 https://www.ebi.ac.uk/ena/browser/view/SAMEA94... https://www.ebi.ac.uk/metagenomics/api/v1/samp... None

1383 rows × 7 columns

And so, we’ve more efficiently fetch the metagenomic assemblies – by not pulling pages and pages of irrelevant data.

MGnify

You can obtain information about the analyses present on MGnify for a certain sample, using the metagenomic_url returned by the data portal API:

response = requests.get('https://www.holofooddata.org/api/samples/SAMEA13604493')
mgnify_url = response.json().get('metagenomics_url')
mgnify_url
'https://www.ebi.ac.uk/metagenomics/api/v1/samples/SAMEA13604493'

You can then use that MGnify API endopint to fetch information about that sample and follow related links through MGnify:

mgnify_response = requests.get(mgnify_url)
mgnify_data = mgnify_response.json().get('data')
runs_url = mgnify_data['relationships']['runs']['links']['related']

mgnify_runs_response = requests.get(runs_url)
mgnify_runs_data = mgnify_runs_response.json().get('data')
mgnify_runs_data
[{'type': 'runs',
  'id': 'ERR9358532',
  'attributes': {'experiment-type': 'metagenomic',
   'is-private': False,
   'accession': 'ERR9358532',
   'secondary-accession': 'ERR9358532',
   'ena-study-accession': None,
   'instrument-platform': 'ILLUMINA',
   'instrument-model': 'Illumina NovaSeq 6000'},
  'relationships': {'analyses': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/runs/ERR9358532/analyses'}},
   'assemblies': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/runs/ERR9358532/assemblies'},
    'data': [{'type': 'assemblies',
      'id': 'ERZ13633665',
      'links': {'self': 'https://www.ebi.ac.uk/metagenomics/api/v1/assemblies/ERZ13633665'}}]},
   'study': {'data': {'type': 'studies', 'id': 'MGYS00006086'},
    'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/studies/MGYS00006086'}},
   'sample': {'data': {'type': 'samples', 'id': 'ERS11206669'},
    'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/samples/ERS11206669'}},
   'pipelines': {'data': []}},
  'links': {'self': 'https://www.ebi.ac.uk/metagenomics/api/v1/runs/ERR9358532'}}]

In most cases, you can also use the base endpoint you’re interested in (e.g. /analyses) with a filter based on the sample accession. E.g. to find a MGnify analyses of a sample:

mgnify_response = requests.get('https://www.ebi.ac.uk/metagenomics/api/v1/analyses?sample_accession=SAMEA13604493')
mgnify_response.json().get('data')
[{'type': 'analysis-jobs',
  'id': 'MGYA00617383',
  'attributes': {'analysis-status': 'completed',
   'accession': 'MGYA00617383',
   'experiment-type': 'assembly',
   'analysis-summary': [{'key': 'Submitted nucleotide sequences',
     'value': '344'},
    {'key': 'Nucleotide sequences after format-specific filtering',
     'value': '344'},
    {'key': 'Nucleotide sequences after length filtering', 'value': '344'},
    {'key': 'Nucleotide sequences after undetermined bases filtering',
     'value': '344'},
    {'key': 'Reads with predicted CDS', 'value': '330'},
    {'key': 'Reads with predicted RNA', 'value': '13'},
    {'key': 'Reads with InterProScan match', 'value': '108'},
    {'key': 'Predicted CDS', 'value': '531'},
    {'key': 'Predicted CDS with InterProScan match', 'value': '127'},
    {'key': 'Total InterProScan matches', 'value': '299'},
    {'key': 'Predicted SSU sequences', 'value': '0'},
    {'key': 'Predicted LSU sequences', 'value': '2'}],
   'pipeline-version': '5.0',
   'is-private': False,
   'complete-time': '2022-12-01T19:35:07',
   'instrument-platform': 'ILLUMINA',
   'instrument-model': 'Illumina NovaSeq 6000'},
  'relationships': {'taxonomy-lsu': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/lsu'}},
   'go-slim': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/go-slim'}},
   'antismash-gene-clusters': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/antismash-gene-clusters'}},
   'genome-properties': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/genome-properties'}},
   'run': {'data': None},
   'study': {'data': {'type': 'studies', 'id': 'MGYS00006086'},
    'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/studies/MGYS00006086'}},
   'go-terms': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/go-terms'}},
   'taxonomy-itsunite': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/unite'}},
   'sample': {'data': {'type': 'samples', 'id': 'ERS11206669'},
    'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/samples/ERS11206669'}},
   'taxonomy-itsonedb': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/itsonedb'}},
   'taxonomy': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy'}},
   'assembly': {'data': {'type': 'assemblies', 'id': 'ERZ13633665'},
    'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/assemblies/ERZ13633665'}},
   'downloads': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/downloads'}},
   'interpro-identifiers': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/interpro-identifiers'}},
   'taxonomy-ssu': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/ssu'}}},
  'links': {'self': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383'}}]

MetaboLights

You can find the MetaboLights URL for a study from the portal API:

mtbls_response = requests.get('https://www.holofooddata.org/api/samples/SAMEA112907471')
mtbls_url = mtbls_response.json().get('metabolomics_url')
mtbls_url
'https://www.ebi.ac.uk/metabolights/ws/studies/MTBLS6988'

This URL is for the entire study on MetaboLights, rather than a specific sample. The data portal website (but not the API, for performance reasons) navigate the MetaboLights API in order to find the files relevant to a particular sample.

The code that does this in the data portal shows how this can be achieved. In short: fetch the files list from the MetaboLights API, and find the metadata_sample file (it is an ISA-TAB file). Look through that file for the the row where Comment[BioSamples accession] is your sample(s) of interest. This field may have slightly different names in different studies. This lets you find the Sample Name corresponding to the BioSample accession.

Then, find the metadata_assay file(s) from the file list. Concatenate all of these metadata_assay tab files, and find all the rows from all of the metadata_assay sheets where the Sample Name matches.

These assay sheets will have columns of interest about the assays used, the raw and derived files, and the metabolites association files. These are filenames which you can then download.