Dear all,
Is it possible to download from terminal raw sequences from the 1000 Genome project (http://www.internationalgenome.org/) in a reiterative way? I am looking for fastq or sra files to be downloaded from the command line (terminal), but I find difficult to understand what is the URL of the sequences so to get the data I need with a simple name.
For instance, in the portal page of the project there is a spreadsheet of the data used in the project (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_sample_info.xlsx), from which I can look for instance at the details of the samples NA06984 and NA06985, from which it can be seen that the former was sequenced with Illumina platform and the latter with ABI Solid. In the data page (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/data/) there is a list of folders, whose first two are NA06984 and NA06985. The fastq file of the former is present within the NA06984 folder at the URL ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/data/NA06984/sequence_read/SRR006041.recal.fastq.gz that is kind of straightforward, even though the given name NA06984 is not matching directly the name SRR006041 given to the raw file. The folder NA06985, on the other hand, has a series of files whose first are: SRR008003.fastq.gz, SRR008003_1.fastq.gz, SRR008003_2.fastq.gz, SRR008004.fastq.gz, SRR008004_1.fastq.gz, SRR008004_2.fastq.gz. These names are not reported in the spreadsheet with the samples' information.
For further reference, the next sample -- NA06986 -- has been performed with ABI and also has many entries; NA07037 is reported performed by Illumina but the corresponding folder contains plenty of files reported as 'ERR' (error?); NA07051, also performed with Illumina, has a mix of files, for instance: ERR000552_1.recal.fastq.gz, ERR000552_2.recal.fastq.gz, SRR003539_1.recal.fastq.gz, SRR003539_2.recal.fastq.gz, SRR003540_1.recal.fastq.gz, SRR003540_2.recal.fastq.gz. The way the files are reported looks a bit erratic to me, making difficult to write a simple parameter expansion to pick up the files.
Is there therefore a structure to follow in order to download the raw data of the 1000 Genome project? The aim is to download the data easily from the terminal as opposed to to navigate the files manually from the browser. Also, there should be a discrimination between the different sequencing technologies applied. but this information is not directly present in the file's same.
Thank you
Thank you! I have been trying to find an answer to this issue and you gave me the answer straight away. Very useful, I will use it. Best regards
No problem. You can reply back here if you encountered any other issues.
All good so far. Just wanted to ask you this: what is the difference between a fastq file with no suffix and those with the 1/2 suffix, for instance ERR000044.filt.fastq.gz, ERR000044_1.filt.fastq.gz, ERR000044_2.filt.fastq.gz? I reckon the latter two are the mates, what about the former? Tx
Ah, yes, those are mate-pairs and need to be matched together. In the 1000 Genomes data (and listed in that file that I mentioned), there should also be a column that has SINGLE or PAIRED.
also, there are multiple files for the same patient. for instance, for the sample NA12878 there are 108 mate pairs (from ERR001268 to ERR001775). is there a rationale to extract a single pair of sequences for each sample?
I'm not sure that there is any standard rationale for choosing these. However, you should generally aim to match all of your samples based on
If you still have multiple samples after matching on these, then I would go by:
(choose the one with the highest)
fair enough thank you