Question

Downloading genomes for drosophila species

0

Entering edit mode

10.2 years ago

steven ▴ 70

I am trying to download either full genomes or wgs assembled sequences (depending on what is available) of several drosophila species.

For most species, I was able to find an entry in the NCBI Genome database (e.g., http://www.ncbi.nlm.nih.gov/genome/genomes/3489 ?) that linked to a wgs download page in zipped fasta format (http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=AFFE02 downloads tab). They were all around 50 megabytes zipped.

However, several species were not available in the Genome database and I was only able to find them in the SRA database. When downloaded and converted to fastq format, they ended up being very large files (three were around 10 gigs, one was 26 gigs) and this seemed strange to me in comparison with the 50 mb archives.

Why are the .sra and fastq files so much larger than the zipped wgs files?

Thanks!

genome wgs sra • 2.6k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by steven ▴ 70

score 2 · Accepted Answer · 2015-06-04

The best place to download Drosophila genomes (and annotations, fastas with genes or peptides, etc) is Flybase.

The .fastq and .sra files do not contain assembled genomes, they contain raw sequencing reads or sometimes .bam alignments - hence they are much larger. You have to search if those samples were analyzed, who deposited, if it is published, and so forth. If it has not been published, you should contact the depositor before using the data, to avoid publishing something they are already working on.