download raw data from 1000 Genome project from terminal
1
1
Entering edit mode
7.1 years ago

Dear all,

Is it possible to download from terminal raw sequences from the 1000 Genome project (http://www.internationalgenome.org/) in a reiterative way? I am looking for fastq or sra files to be downloaded from the command line (terminal), but I find difficult to understand what is the URL of the sequences so to get the data I need with a simple name.

For instance, in the portal page of the project there is a spreadsheet of the data used in the project (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_sample_info.xlsx), from which I can look for instance at the details of the samples NA06984 and NA06985, from which it can be seen that the former was sequenced with Illumina platform and the latter with ABI Solid. In the data page (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/data/) there is a list of folders, whose first two are NA06984 and NA06985. The fastq file of the former is present within the NA06984 folder at the URL ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/data/NA06984/sequence_read/SRR006041.recal.fastq.gz that is kind of straightforward, even though the given name NA06984 is not matching directly the name SRR006041 given to the raw file. The folder NA06985, on the other hand, has a series of files whose first are: SRR008003.fastq.gz, SRR008003_1.fastq.gz, SRR008003_2.fastq.gz, SRR008004.fastq.gz, SRR008004_1.fastq.gz, SRR008004_2.fastq.gz. These names are not reported in the spreadsheet with the samples' information.

For further reference, the next sample -- NA06986 -- has been performed with ABI and also has many entries; NA07037 is reported performed by Illumina but the corresponding folder contains plenty of files reported as 'ERR' (error?); NA07051, also performed with Illumina, has a mix of files, for instance: ERR000552_1.recal.fastq.gz, ERR000552_2.recal.fastq.gz, SRR003539_1.recal.fastq.gz, SRR003539_2.recal.fastq.gz, SRR003540_1.recal.fastq.gz, SRR003540_2.recal.fastq.gz. The way the files are reported looks a bit erratic to me, making difficult to write a simple parameter expansion to pick up the files.

Is there therefore a structure to follow in order to download the raw data of the 1000 Genome project? The aim is to download the data easily from the terminal as opposed to to navigate the files manually from the browser. Also, there should be a discrimination between the different sequencing technologies applied. but this information is not directly present in the file's same.

Thank you

RNA-Seq sequence • 3.7k views
ADD COMMENT
0
Entering edit mode
7.1 years ago

Hello,

No, 'ERR' does not mean error. It's just an unfortunte prefix.

I would focus on the Phase III data and specifically look at the file index listing, which contains relative paths: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/20130502.phase3.sequence.index (careful as large (64MB) text file). It contains information on platform used (for sequencing), whether single or paired end, ethnic group, and lots of other stuff. It should be easy to 'pluck' out samples/files that you want and to the automate the download via the value in the FASTQ_FILE column (first column).

FASTQ_FILE MD5 RUN_ID STUDY_ID STUDY_NAME CENTER_NAME SUBMISSION_ID SUBMISSION_DATE SAMPLE_ID SAMPLE_NAME POPULATION EXPERIMENT_ID INSTRUMENT_PLATFORM INSTRUMENT_MODEL LIBRARY_NAME RUN_NAME RUN_BLOCK_NAME INSERT_SIZE LIBRARY_LAYOUT PAIRED_FASTQ WITHDRAWN WITHDRAWN_DATE COMMENT READ_COUNT BASE_COUNT ANALYSIS_GROUP

data/NA19238/sequence_read/ERR000018.filt.fastq.gz 3b092ef1661e2a8ff85050e01242707d ERR000018 SRP000032 1000Genomes Project Pilot 2 BGI ERA000013 2008-08-14 00:00:00 SRS000212 NA19238 YRI ERX000014 ILLUMINA Illumina Genome Analyzer HU1000RADCAASE BGI-FC307N0AAXX 0 SINGLE 0 9280498 334097928 high coverage

data/NA19238/sequence_read/ERR000019.filt.fastq.gz fcb89b0a755773872f1b073d0a518e0a ERR000019 SRP000032 1000Genomes Project Pilot 2 BGI ERA000013 2008-08-14 00:00:00 SRS000212 NA19238 YRI ERX000014 ILLUMINA Illumina Genome Analyzer HU1000RADCAASE BGI-FC307AWAAXX 0 SINGLE 0 9571982 344591352 high coverage

data/NA19240/sequence_read/ERR000020.filt.fastq.gz dcd4ff7db25a75e462beaa75eb167bea ERR000020 SRP000032 1000Genomes Project Pilot 2 BGI ERA000013 2008-08-14 00:00:00 SRS000214 NA19240 YRI ERX000016 ILLUMINA Illumina Genome Analyzer II QRAACDEAAPE BGI-FC206YCAAXX_3 345 PAIRED 0 149044 5365584 high coverage

data/NA19240/sequence_read/ERR000020_1.filt.fastq.gz fb5d7eb5137aa173f9f9ec344bd7a8e7 ERR000020 SRP000032 1000Genomes Project Pilot 2 BGI ERA000013 2008-08-14 00:00:00 SRS000214 NA19240 YRI ERX000016 ILLUMINA Illumina Genome Analyzer II QRAACDEAAPE BGI-FC206YCAAXX_3

I also have a tutorial (here Produce PCA bi-plot for 1000 Genomes Phase III in VCF format ) that allows you to automatically download the 1000 Genomes Phase III VCF, which may or may not assist.

Kevin

ADD COMMENT
0
Entering edit mode

Thank you! I have been trying to find an answer to this issue and you gave me the answer straight away. Very useful, I will use it. Best regards

ADD REPLY
0
Entering edit mode

No problem. You can reply back here if you encountered any other issues.

ADD REPLY
0
Entering edit mode

All good so far. Just wanted to ask you this: what is the difference between a fastq file with no suffix and those with the 1/2 suffix, for instance ERR000044.filt.fastq.gz, ERR000044_1.filt.fastq.gz, ERR000044_2.filt.fastq.gz? I reckon the latter two are the mates, what about the former? Tx

ADD REPLY
0
Entering edit mode

Ah, yes, those are mate-pairs and need to be matched together. In the 1000 Genomes data (and listed in that file that I mentioned), there should also be a column that has SINGLE or PAIRED.

ADD REPLY
0
Entering edit mode

also, there are multiple files for the same patient. for instance, for the sample NA12878 there are 108 mate pairs (from ERR001268 to ERR001775). is there a rationale to extract a single pair of sequences for each sample?

ADD REPLY
0
Entering edit mode

I'm not sure that there is any standard rationale for choosing these. However, you should generally aim to match all of your samples based on

  • INSTRUMENT_PLATFORM
  • INSTRUMENT_MODEL
  • LIBRARY_NAME
  • PAIRED_FASTQ

If you still have multiple samples after matching on these, then I would go by:

  • READ_COUNT
  • BASE_COUNT

(choose the one with the highest)

ADD REPLY
0
Entering edit mode

fair enough thank you

ADD REPLY

Login before adding your answer.

Traffic: 1250 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6