Question

download illumina reads from ENA database

0

Entering edit mode

6.8 years ago

luyang1005 ▴ 20

Hi,

I got the EBI number 'ERP012803' in this paper 'Improved Bacterial 16S rRNA Gene (V4 and V4-5) and Fungal Internal Transcribed Spacer Marker Gene Primers for Microbial Community Surveys'.

Based on this number I went to the website to the https://www.ebi.ac.uk/ena website to search the sequences files. However, I got the results of 19327 runs. Then I do not know how to download them.

Is there anyone who can help me with that.

Thanks.

RNA-Seq sequence • 3.0k views

ADD COMMENT • link updated 6.8 years ago by lieven.sterck 15k • written 6.8 years ago by luyang1005 ▴ 20

0

Entering edit mode

Are you sure you want to download the entire dataset? You could download the XML summary and then parse that file.

ADD REPLY • link 6.8 years ago by GenoMax 147k

0

Entering edit mode

This post will get you on your way A: Downloading Multi Experiment .Sra Files From Ncbi Archive Automatedly

basically: get the txt file format of the runs, select column with the fastq ftp url, wget those on the linux cmdline

ADD REPLY • link 6.8 years ago by lieven.sterck 15k

0

Entering edit mode

Unfortunately this is not going to work here. ENA entry for ERP012803 only lists the XML format file download (no TXT).

If you go into one of the samples then you have the option to "Bulk download the data" (if you have Java installed). I did not try to see if that downloads the entire set or just that sample for obvious reasons :) Individual samples will show the ftp URL (if you look at the TXT output) but that would still require a lot of parsing.

ADD REPLY • link 6.8 years ago by GenoMax 147k

0

Entering edit mode

Thanks. I can bulk download, I click the 'ERR2032136 Illumina MiSeq sequencing; qiita_ptid_2603:10317.000026630' under Run (19,327 results found). I got the reads as following. Based on the head information. I can not identify the 10 samples mentioned in the paper. May I know is there any demultiplex info for the sequences? I think that is my not fully understanding of this data base.

        head ERR2032136.fastq 
        @ERR2032136.1 10317.000026630_0/1
    TCTGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGTGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCACCCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAG
   +
    >>>>>DDFBDFFB1FEEGEEFGGGG?EEHB2FFCHFFGFGFFFFEECA?AE/AFFG/1A/>>/FHHGGBGGH1F@1FFFEE/>/>E/F1BEEEG///<FHH1?FG11?DCG1>1CFGCG1FDCFGF1ACCCC?CGCFH0C---;CC0;00
     @ERR2032136.2 10317.000026630_1/1
        TACGTAGGTGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGCGTGTAGGCGGGATTGCAAGTCAGATGTGAAAACTGGGGGCTCAACCTCCAGCCTGCATTTGAAACTGTAGTTCTTGAGTGCTGGAGAGGCAATCGGAATTCCGTG
        +
        >3>AAFFBFFAFGGGGGGGCGGHGGGGGHHGHHHHHGGHHHHGGGGAFFGHHHGGGGGHHHHHHHHEHHHHGHFFHFFHHHHGGEGHHHHGBCFFHFGFGHGFHHFHDFHHHHGGHHHHHHHHHHGHFHHGEGGGGCBFEGDGGHHGGHE
        @ERR2032136.3 10317.000026630_2/1
        TCCGTAGGTGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGCGTGTAGGCGGGATTGCAAGTCAGATGTGAAAACTGGGGGCTCAACCTCCAGCCTGCATTTTAAACTGTAGTTCTTGAGTGCTGGAGAGGCAATCGGAATTCCGTG

ADD REPLY • link 6.8 years ago by luyang1005 ▴ 20

0

Entering edit mode

OK, I've looked into this specifically and the issue turns out to be a little more complicated then described.

The ERP number you mention is indeed derived from the paper. That ERP number is however the number for the American gut microbiome project at EBI (so yes it will contain a huge amount of runs and files).

However, if you read the text from the paper carefully, the authors mention that they only used a subset of 10 samples from those 19k+ ones (thus as genomax indicated you do not need that full dataset! ). Moreover you will need data files from other ERP numbers as well if you want to get to complete dataset used in this paper.

If you do not want the data from this paper but indeed all data from the gut microbiome project you can use the approach I mentioned before but then with this EBI number PRJEB11419