How to download sample attributes (sample metadata) file from the European nucleotide archive (EMBL-EBI)?
6
2
Entering edit mode
7.2 years ago
alowi33 ▴ 50

Project PRJEB99111 has 147 samples. I want to download the metadata (age, sex, disease status, etc) of each sample, not fastq. The only way I can download the metadata is by downloading the xml file of each sample accession one by one - is there a way to bulk download all 147 metadata files? I can work with xml files if I have to.

You can view the metadata for a specific sample accession by clicking on the"attributes" tab. Here is an example for one sample: https://www.ebi.ac.uk/ena/data/view/SAMEA104228123

EMBL-EBI attributes European nucleotide archive • 11k views
ADD COMMENT
6
Entering edit mode
7.2 years ago

with the following xslt stylesheet:

$ wget -q  -O - "https://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=PRJEB99111&result=read_run&fields=study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,tax_id,scientific_name,instrument_model,library_layout,fastq_ftp,fastq_galaxy,submitted_ftp,submitted_galaxy,sra_ftp,sra_galaxy,cram_index_ftp,cram_index_galaxy&download=txt" | grep -v sample_accession | cut -f 3 | awk '{printf("https://www.ebi.ac.uk/ena/data/view/%s&display=xml\n",$0);}' | while read U; do wget -O - -q "$U" | xsltproc transform.xsl - ; done


ERS1887136|age|61
ERS1887136|age_units|years
ERS1887136|body_habitat|UBERON:feces
ERS1887136|body_product|UBERON:feces
ERS1887136|body_site|UBERON:feces
ERS1887136|collection_site|UCSF
ERS1887136|collection_timestamp|2013-10-08
ERS1887136|day_in_timeseries|Missing: Not provided
ERS1887136|disease_course|RRMS
ERS1887136|disease_state|MS
ERS1887136|dna_extracted|TRUE
ERS1887136|elevation|124
ERS1887136|env_biome|urban biome
ERS1887136|env_feature|human-associated habitat
ERS1887136|env_material|feces
ERS1887136|env_package|human-gut
ERS1887136|flare|No
ERS1887136|geo_loc_name|USA:CA:San Francisco
ERS1887136|height|Missing: Not provided
ERS1887136|height_units|Missing: Not provided
ERS1887136|host_common_name|human
ERS1887136|host scientific name|Homo sapiens
ERS1887136|host_subject_id|34
ERS1887136|host_taxid|9606
ERS1887136|household|H1004
ERS1887136|investigation_type|mimarks-survey
ERS1887136|latitude|37.76
ERS1887136|life_stage|adult
ERS1887136|longitude|-122.46
ERS1887136|physical_specimen_location|UCSF
ERS1887136|physical_specimen_remaining|FALSE
ERS1887136|repeated_sequencing|1
ERS1887136|sample_type|stool
ERS1887136|sequencing_set|2
ERS1887136|sex|female
ERS1887136|sinai_unmarked_rep|Missing: Not provided
ERS1887136|submission_number|1
(...)
ADD COMMENT
0
Entering edit mode

Exquisite solution. However, it only works for the first 3 samples and then the following error code is repeated many times:

unable to parse -
-:1: parser error : Document is empty

^
-:1: parser error : Start tag expected, '<' not found

Perhaps the site stoped granting us access thinking that we were not human. I dont know.

ADD REPLY
1
Entering edit mode

However, it only works for the first 3 samples

works on my machine

https://pastebin.com/sq6dzSKX

ADD REPLY
0
Entering edit mode

Astounding! Much appreciated. I wonder why It didnt fully work for my machine....

ADD REPLY
0
Entering edit mode

How can you make the xslt stylesheet so that sample names are rows and sample attributes are columns, and tab delimited? Example:

               age       age_units   ...
ERS1887136     61        years     ...
ERS1887137     61        years     ...
ERS1887138     44        years     ...
...            ...         ...
ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

I had the same question and came up with a solution using datamash (note that you may have to put it on your machine using something like homebrew if you are on a Mac). Try out this code, which builds on the original solution above.

wget -q  -O - "https://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=PRJEB99111&result=read_run&fields=study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,tax_id,scientific_name,instrument_model,library_layout,fastq_ftp,fastq_galaxy,submitted_ftp,submitted_galaxy,sra_ftp,sra_galaxy,cram_index_ftp,cram_index_galaxy&download=txt" \
| grep -v sample_accession \
| cut -f 3 \
| awk '{printf("https://www.ebi.ac.uk/ena/data/view/%s&display=xml\n",$0);}' \
| while read U; do wget -O - -q "$U" | xsltproc transform.xsl - ; done \
| sed 's/|/\t/g' \
| datamash groupby 1 collapse 3 \
| sed 's/,/\t/g' \
| sed 's/ /_/g'
ADD REPLY
0
Entering edit mode

Hello,

I wanted to ask about the last part of the script you have written "xsltproc transform.xsl" .

If I ran the your whole script, I get an error that transform.xsl is not found " warning: failed to load external entity "transform.xsl" cannot parse transform.xsl". If I ran it step by step results are produced and I seem to get the correct xml but the transfomer is not working.

I am new to Unix but I understand the script up to the xslproc part. When I run " xslproc -h" there is no option transform.xsl. How does the module works does it need to be separately installed?

Thanks, Martina

ADD REPLY
1
Entering edit mode

did you download the XML/XSLT script above ?

ADD REPLY
0
Entering edit mode

No, I hadn't that works now. Thank you!

ADD REPLY
4
Entering edit mode
7.2 years ago
piet ★ 1.9k

Unfortunately NCBI does not contain metadata for this project.

This is not true. You can easily download a XML file containing all of the attributes of all the biosamples from NCBI. Since the procedure may also be useful in other contexts, I will describe it step by step.

First go to the page of the project (the bioproject database in NCBI speach):

https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJEB99111

Next, get a list of all biosamples which are linked to this project. There is a section entitled "Related information" on the right site of the page. To get the list of biosamples, click on the hyperlink "Biosample".

This will open an new page which list the first 20 biosamples in the project. The URL of that page is:

https://www.ncbi.nlm.nih.gov/biosample?LinkName=bioproject_biosample_all&from_uid=400734

On the top of this page (on the right site) is a pull-down menu entitled "Send to:". Click on this menu, then select "File", then select format "Full XML (text)", and finally click on the buttom "Create File". Store the XML file on your local disk and parse it with your favorite XML tool.

ADD COMMENT
0
Entering edit mode

That is what I was looking for. Usually bioprojects in NCBI contain a file with all metadata. This file is available in other bioprojects but I couldn't find it in this project. I didn't know about the option you described. Very simple yet useful. Many thanks.

ADD REPLY
0
Entering edit mode

Thank you, this is useful. I guess sometimes users can miss being able to obtain a single file with every sample`s metadata directly from the ENA repositories, instead of accessing this information through NCBI.

ADD REPLY
2
Entering edit mode
7.2 years ago
GenoMax 148k

Using NCBI eUtils: esearch -db bioproject -query "PRJEB99111" | elink -target biosample | efetch -format docsum | xtract -pattern DocumentSummary -block Attribute -element Attribute

produces (only a sample below)

2017-08-28  2017-08-26  ERS1887138  female  44  years   UBERON:feces    UBERON:feceUBERON:feces UCSF    2013-09-25  Missing: Not provided   RRMS    MS  TRUE    124 urban biome human-associated habitat    feces   human-gut   No  USA:CA:San Francisco    Missing: Not provided   Missing: Not provided   Homo sapiens    111 9606    Missing: Not provided   mimarks-survey  37.76   adult   -122.46 UCSF    FALSE   1   stool   1   Missing: Not provided   1   1_a No  Gut dysbiosis in patients with multiple sclerosis is characterized by bacteria that regulate T lymphocyte differentiation in vitro  No_Treatment    Off Missing: Not provided   Missing: Not provided   dry 1990
2017-08-28  2017-08-26  ERS1887137  male    61  years   UBERON:feces    UBERON:feceUBERON:feces UCSF    Missing: Not provided   Missing: Not provided   RRMS    MS  TRUE    124urban biome  human-associated habitat    feces   human-gut   No  USA:CA:San FranciscMissing: Not provided    Missing: Not provided   Homo sapiens    62  9606    Missing: Not provided   mimarks-survey  37.76   adult   -122.46 UCSF    FALSE   1   stool   2   Missing: Not provided   1   1_a No  Gut dysbiosis in patients with multiple sclerosis is characterized by bacteria that regulate T lymphocyte differentiation in vitro  No_Treatment    Off Missing: Not provided   Missing: Not provided   dry 1984
ADD COMMENT
0
Entering edit mode

Worked great :) Anyway to include each attribute's category in the first line?

ADD REPLY
1
Entering edit mode
7.2 years ago
piet ★ 1.9k

I my opinion, NCBI Entrez/Eutils is more versatile than EBI for downloads like this. If you want to stick with EBI, you can run the loop over all entries of the project on your local computer. There are only 147 samples. Since tasks like this are usually run only once, do not worry to much about computational efficiency.

First download the list of all sample accessions in the project:

wget 'https://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=PRJEB99111&result=read_run&fields=sample_accession&download=txt' -O - | tee /tmp/acc.lst | less

A single biosample with all attributes can be fetched in this way:

wget 'https://www.ebi.ac.uk/ena/data/view/SAMEA104228123&display=xml' -O - | less

To fetch all samples, loop over all of the sample accessions in the list:

foreach a (`cat /tmp/acc.lst`)
      wget "https://www.ebi.ac.uk/ena/data/view/$a&display=xml" -O $a.xml
end

The above shows how to accomplish it with C shell. It should also be easy to achieve this with python and requests.

ADD COMMENT
1
Entering edit mode

it worked fine by me. I modified the code a little bit:

code:

$ wget 'https://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=PRJEB99111&result=read_run&fields=sample_accession&download=txt' -O samples.lst
$ sed 1d samples.lst | parallel --delay 2 'wget "https://www.ebi.ac.uk/ena/data/view/{}&display=xml" -O {}.xml'

Note:

  1. List of samples file comes with a header. Hence removed first line
  2. I used a delay of 2 second in parallel. This can be removed or lessened further. Output will be in xml format and will have sample name as name and xml as extension.
ADD REPLY
0
Entering edit mode

Unfortunately NCBI does not contain metadata for this project. I get the error "Unable to establish SSL connection" using your codes. I have tried pythons request function but after one successful xml reading the connection fails when I try to read again. You can see my sample codes here: python stopped opening xml url, connection closed.

ADD REPLY
0
Entering edit mode

Are you behind a HTTP proxy?

ADD REPLY
0
Entering edit mode

I am ssh-ed in to a remote server. I didnt ssh using a key - that is the "key" to solving my problem ;)

ADD REPLY
0
Entering edit mode
7.2 years ago
LLTommy ★ 1.2k

ENA meta data you can also get from EBI's Biosample database, so e.g for the Sample SAMEA104228123 you mentioned you should find under https://www.ebi.ac.uk/biosamples/samples/SAMEA104228123. You can get the data in xml but also in JSON (find the button in the right corner) via the api (e.g. https://www.ebi.ac.uk/biosamples/api/samples/SAMEA104228123

ADD COMMENT
0
Entering edit mode

I did not know about the JSON, that is interesting.

ADD REPLY
1
Entering edit mode

Glad if I could help you. If you are interested in the API and json, have a look at the API documentation for biosamples - https://www.ebi.ac.uk/biosamples/help/api

ADD REPLY
0
Entering edit mode
2.1 years ago
Polina ▴ 10

in case it might be helpful to anyone else Since, I bumped to a problem of downloading datasets and extracting metadata quite often, I've created a Python tool: ENATool, which downloads and parses xml from ENA browser to csv format and also alllows to download raw data. ENA and NCBI databases instersect pretty much, so, it's quite an easy way of dealing with free published data.

ADD COMMENT

Login before adding your answer.

Traffic: 2089 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6