API for Programmatic Access

APIs (Application Programming Interfaces) allow programmatic access to data resources. If you need to fetch and analyse metadata for many Samples, you can write a script to fetch data in this way.

The HoloFood Data Portal API gives programmatic access to the HoloFood Samples and their metadata, as well as URLs for where datasets are stored in public archives.

Canonical URLs

Throughout the API, canonical_urls are returned which point to the canonical database entry, i.e. the authoritative source, for each data object.

These are:

BioSamples for Samples (both host-animal level and derived-extraction level).
The European Nucleotide Archive Browser for nucleotide sequencing data.
MGnify for metagenomic-derived analyses and MAGs (metagenome assembled genomes)
MetaboLights for metabolomic analyses
The websites of various partner institutions and registries where an IRI has been supplied with a metadata entry.
The HoloFood Data Portal itself for “Analysis Summaries”, which are documents hosted only by this website.
The HoloFood Data Portal itself for viral annotations

API Endpoints and Playground

The “Browsable API Playground” is the best place to discover the API. Find this under the API navigation item on the data portal.

The browsable API lets you see the endpoints and their response schemas. Under an endpoint, press Try it out to see the output for a specific query.

In brief, the top-level endpoints are:

`/api/samples`

List samples, or fetch details about a specific sample (like its metadata). List all possible metadata markers (i.e. keys).

`/api/animals`

List animals (hosts), or fetch details about a specific animal (like its metadata and derived samples).

`/api/analysis-summaries`

List summary analyses published on the data portal.

`/api/genome-catalogues`

List MAG catalogues, or fetch detail about a catalogue, or list the MAGs within a catalogue.

`/api/viral-catalogues`

List Viral catalogues, or fetch detail about a catalogue, or list the fragments within a catalogue.

Using the API

From the command line

Use a command line tool like cURL to query the API. Responses are in JSON format.

For example to list all samples:

curl https://www.holofooddata.org/api/samples

To find the Stearic acid 18:0 data associated with Fatty Acids sample SAMEA112949944, we could (using jq to handle JSON data on the command line):

curl https://www.holofooddata.org/api/samples/SAMEA112949944 | jq '.structured_metadata | .[] | select(.marker.name == "Stearic acid 18:0")'

and get

{
  "marker": {
    "name": "Stearic acid 18:0",
    "type": "FATTY ACIDS MG",
    "canonical_url": null
  },
  "measurement": "1,47",
  "units": "mg/g"
}
{
  "marker": {
    "name": "Stearic acid 18:0",
    "type": "FATTY ACIDS PERCENTAGE",
    "canonical_url": null
  },
  "measurement": "1,47",
  "units": "%"
}

From Python

More realistically, to fetch a list of samples as a Pandas dataframe

Packages required

pip install requests pandas

import requests
import pandas as pd

samples = requests.get('https://www.holofooddata.org/api/samples')

samples_df = pd.json_normalize(samples.json()['items'])
samples_df.head(3)

	accession	title	sample_type	animal	canonical_url	metagenomics_url	metabolomics_url
0	SAMEA10104908	CA01.07F1a	metagenomic_assembly	SAMEA112905066	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
1	SAMEA10104910	CA02.12F1a	metagenomic_assembly	SAMEA112904813	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
2	SAMEA10104911	CA02.18F1a	metagenomic_assembly	SAMEA112904777	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None

Paginated data

However… we haven’t got all of the data. We only have one page.

The ?page= URL query parameter lets us retrieve subsequent pages.

len(samples_df)

The API response does tell us how many items there are in total:

samples.json()['count']

samples_endpoint_base = 'https://www.holofooddata.org/api/samples'

page = 1

# We will only fetch the first 10 pages for now...

max_pages = 10

while page and page < max_pages:
    print(f'Fetching {page=}')
    
    samples_page = requests.get(f'{samples_endpoint_base}?{page=}').json()
    samples_page_df = pd.json_normalize(samples_page['items'])
    
    if page == 1:
        samples_df = samples_page_df
    else:
        samples_df = pd.concat([samples_df, samples_page_df])
    page += 1
    if len(samples_df) >= samples_page['count']:
        page = False

Fetching page=1
Fetching page=2
Fetching page=3
Fetching page=4
Fetching page=5
Fetching page=6
Fetching page=7
Fetching page=8
Fetching page=9

len(samples_df)

Find all metagenomic assembly samples:

samples_df

	accession	title	sample_type	animal	canonical_url	metagenomics_url	metabolomics_url
0	SAMEA10104908	CA01.07F1a	metagenomic_assembly	SAMEA112905066	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
1	SAMEA10104910	CA02.12F1a	metagenomic_assembly	SAMEA112904813	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
2	SAMEA10104911	CA02.18F1a	metagenomic_assembly	SAMEA112904777	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
3	SAMEA10104912	CA03.10F1a	metagenomic_assembly	SAMEA112904752	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
4	SAMEA10104914	CA04.10F1a	metagenomic_assembly	SAMEA112904915	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
...	...	...	...	...	...	...	...
95	SAMEA112369857	SB11.04B1a	transcriptomic	SAMEA112949107	https://www.ebi.ac.uk/biosamples/SAMEA112369857	None	None
96	SAMEA112369858	SB11.05B1a	transcriptomic	SAMEA112949088	https://www.ebi.ac.uk/biosamples/SAMEA112369858	None	None
97	SAMEA112369859	SB11.11B1a	transcriptomic	SAMEA112949878	https://www.ebi.ac.uk/biosamples/SAMEA112369859	None	None
98	SAMEA112369860	SB11.12B1a	transcriptomic	SAMEA112948858	https://www.ebi.ac.uk/biosamples/SAMEA112369860	None	None
99	SAMEA112369861	SB11.13B1a	transcriptomic	SAMEA112948701	https://www.ebi.ac.uk/biosamples/SAMEA112369861	None	None

900 rows × 7 columns

metagenome_assembly_samples = samples_df[samples_df.sample_type == 'metagenomic_assembly']
print(len(metagenome_assembly_samples))
metagenome_assembly_samples.head(3)

	accession	title	sample_type	animal	canonical_url	metagenomics_url	metabolomics_url
0	SAMEA10104908	CA01.07F1a	metagenomic_assembly	SAMEA112905066	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
1	SAMEA10104910	CA02.12F1a	metagenomic_assembly	SAMEA112904813	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
2	SAMEA10104911	CA02.18F1a	metagenomic_assembly	SAMEA112904777	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None

Filters using query parameters

We didn’t need to fetch all the samples to then select only the metagenomic ones. We can instead fetch only the metagenomic ones using a query parameter filter. (If you’re familiar with SQL, this eventually maps to a WHERE clause on the database.)

page = 1

while page:
    print(f'Fetching {page=}')
    
    mg_samples_page = requests.get(f'{samples_endpoint_base}?{page=}&sample_type=metagenomic_assembly').json()
    mg_samples_page_df = pd.json_normalize(mg_samples_page['items'])
    
    if page == 1:
        mg_samples_df = mg_samples_page_df
    else:
        mg_samples_df = pd.concat([mg_samples_df, mg_samples_page_df])
    page += 1
    if len(mg_samples_df) >= mg_samples_page['count']:
        page = False

Fetching page=1
Fetching page=2
Fetching page=3
Fetching page=4
Fetching page=5
Fetching page=6
Fetching page=7
Fetching page=8
Fetching page=9
Fetching page=10
Fetching page=11
Fetching page=12
Fetching page=13
Fetching page=14

samples_df

	accession	title	sample_type	animal	canonical_url	metagenomics_url	metabolomics_url
0	SAMEA10104908	CA01.07F1a	metagenomic_assembly	SAMEA112905066	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
1	SAMEA10104910	CA02.12F1a	metagenomic_assembly	SAMEA112904813	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
2	SAMEA10104911	CA02.18F1a	metagenomic_assembly	SAMEA112904777	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
3	SAMEA10104912	CA03.10F1a	metagenomic_assembly	SAMEA112904752	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
4	SAMEA10104914	CA04.10F1a	metagenomic_assembly	SAMEA112904915	https://www.ebi.ac.uk/ena/browser/view/SAMEA10...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
...	...	...	...	...	...	...	...
78	SAMEA9449961	CC17.06F1a	metagenomic_assembly	SAMEA112905387	https://www.ebi.ac.uk/ena/browser/view/SAMEA94...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
79	SAMEA9449962	CC17.10F1a	metagenomic_assembly	SAMEA112905371	https://www.ebi.ac.uk/ena/browser/view/SAMEA94...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
80	SAMEA9449963	CC16.16F1a	metagenomic_assembly	SAMEA112905164	https://www.ebi.ac.uk/ena/browser/view/SAMEA94...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
81	SAMEA9449964	CC18.03F1a	metagenomic_assembly	SAMEA112904811	https://www.ebi.ac.uk/ena/browser/view/SAMEA94...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None
82	SAMEA9449965	CC18.05F1a	metagenomic_assembly	SAMEA112905140	https://www.ebi.ac.uk/ena/browser/view/SAMEA94...	https://www.ebi.ac.uk/metagenomics/api/v1/samp...	None

1383 rows × 7 columns

And so, we’ve more efficiently fetch the metagenomic assemblies – by not pulling pages and pages of irrelevant data.

MGnify

You can obtain information about the analyses present on MGnify for a certain sample, using the metagenomic_url returned by the data portal API:

response = requests.get('https://www.holofooddata.org/api/samples/SAMEA13604493')
mgnify_url = response.json().get('metagenomics_url')
mgnify_url

'https://www.ebi.ac.uk/metagenomics/api/v1/samples/SAMEA13604493'

You can then use that MGnify API endopint to fetch information about that sample and follow related links through MGnify:

mgnify_response = requests.get(mgnify_url)
mgnify_data = mgnify_response.json().get('data')
runs_url = mgnify_data['relationships']['runs']['links']['related']

mgnify_runs_response = requests.get(runs_url)
mgnify_runs_data = mgnify_runs_response.json().get('data')
mgnify_runs_data

[{'type': 'runs',
  'id': 'ERR9358532',
  'attributes': {'experiment-type': 'metagenomic',
   'is-private': False,
   'accession': 'ERR9358532',
   'secondary-accession': 'ERR9358532',
   'ena-study-accession': None,
   'instrument-platform': 'ILLUMINA',
   'instrument-model': 'Illumina NovaSeq 6000'},
  'relationships': {'analyses': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/runs/ERR9358532/analyses'}},
   'assemblies': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/runs/ERR9358532/assemblies'},
    'data': [{'type': 'assemblies',
      'id': 'ERZ13633665',
      'links': {'self': 'https://www.ebi.ac.uk/metagenomics/api/v1/assemblies/ERZ13633665'}}]},
   'study': {'data': {'type': 'studies', 'id': 'MGYS00006086'},
    'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/studies/MGYS00006086'}},
   'sample': {'data': {'type': 'samples', 'id': 'ERS11206669'},
    'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/samples/ERS11206669'}},
   'pipelines': {'data': []}},
  'links': {'self': 'https://www.ebi.ac.uk/metagenomics/api/v1/runs/ERR9358532'}}]

In most cases, you can also use the base endpoint you’re interested in (e.g. /analyses) with a filter based on the sample accession. E.g. to find a MGnify analyses of a sample:

mgnify_response = requests.get('https://www.ebi.ac.uk/metagenomics/api/v1/analyses?sample_accession=SAMEA13604493')
mgnify_response.json().get('data')

[{'type': 'analysis-jobs',
  'id': 'MGYA00617383',
  'attributes': {'analysis-status': 'completed',
   'accession': 'MGYA00617383',
   'experiment-type': 'assembly',
   'analysis-summary': [{'key': 'Submitted nucleotide sequences',
     'value': '344'},
    {'key': 'Nucleotide sequences after format-specific filtering',
     'value': '344'},
    {'key': 'Nucleotide sequences after length filtering', 'value': '344'},
    {'key': 'Nucleotide sequences after undetermined bases filtering',
     'value': '344'},
    {'key': 'Reads with predicted CDS', 'value': '330'},
    {'key': 'Reads with predicted RNA', 'value': '13'},
    {'key': 'Reads with InterProScan match', 'value': '108'},
    {'key': 'Predicted CDS', 'value': '531'},
    {'key': 'Predicted CDS with InterProScan match', 'value': '127'},
    {'key': 'Total InterProScan matches', 'value': '299'},
    {'key': 'Predicted SSU sequences', 'value': '0'},
    {'key': 'Predicted LSU sequences', 'value': '2'}],
   'pipeline-version': '5.0',
   'is-private': False,
   'complete-time': '2022-12-01T19:35:07',
   'instrument-platform': 'ILLUMINA',
   'instrument-model': 'Illumina NovaSeq 6000'},
  'relationships': {'taxonomy-lsu': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/lsu'}},
   'go-slim': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/go-slim'}},
   'antismash-gene-clusters': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/antismash-gene-clusters'}},
   'genome-properties': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/genome-properties'}},
   'run': {'data': None},
   'study': {'data': {'type': 'studies', 'id': 'MGYS00006086'},
    'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/studies/MGYS00006086'}},
   'go-terms': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/go-terms'}},
   'taxonomy-itsunite': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/unite'}},
   'sample': {'data': {'type': 'samples', 'id': 'ERS11206669'},
    'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/samples/ERS11206669'}},
   'taxonomy-itsonedb': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/itsonedb'}},
   'taxonomy': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy'}},
   'assembly': {'data': {'type': 'assemblies', 'id': 'ERZ13633665'},
    'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/assemblies/ERZ13633665'}},
   'downloads': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/downloads'}},
   'interpro-identifiers': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/interpro-identifiers'}},
   'taxonomy-ssu': {'links': {'related': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383/taxonomy/ssu'}}},
  'links': {'self': 'https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00617383'}}]

MetaboLights

You can find the MetaboLights URL for a study from the portal API:

mtbls_response = requests.get('https://www.holofooddata.org/api/samples/SAMEA112907471')
mtbls_url = mtbls_response.json().get('metabolomics_url')
mtbls_url

'https://www.ebi.ac.uk/metabolights/ws/studies/MTBLS6988'

This URL is for the entire study on MetaboLights, rather than a specific sample. The data portal website (but not the API, for performance reasons) navigate the MetaboLights API in order to find the files relevant to a particular sample.

The code that does this in the data portal shows how this can be achieved. In short: fetch the files list from the MetaboLights API, and find the metadata_sample file (it is an ISA-TAB file). Look through that file for the the row where Comment[BioSamples accession] is your sample(s) of interest. This field may have slightly different names in different studies. This lets you find the Sample Name corresponding to the BioSample accession.

Then, find the metadata_assay file(s) from the file list. Concatenate all of these metadata_assay tab files, and find all the rows from all of the metadata_assay sheets where the Sample Name matches.

These assay sheets will have columns of interest about the assays used, the raw and derived files, and the metabolites association files. These are filenames which you can then download.

From R

Users of the R language may wish to use the HoloFoodR package to access the API.