Question

Is There A Website With Fastq Encoded Dna Data Of Various Organisms (Animals)?

2

Entering edit mode

13.3 years ago

Martyix ▴ 120

I have found the DNA corpus and it contains a few DNA sequences (chromosomes) of yeast, mouse, athaliana and human. It is in FASTA (only letters a,c,t,g) format but as far as I know it is not possible to convert this format to FASTQ because I don't have quality numbers (I tried a few tools but none of them worked for me).

My motivation is: I would like to create a phylogenetic tree with DSRC which accepts only FASTQ format. This work is for my elementary course of bioinformatics.

Thanks!

dna fastq • 5.3k views

ADD COMMENT • link updated 11.8 years ago by Biostar 20 • written 13.3 years ago by Martyix ▴ 120

3

Entering edit mode

I've skimmed that PDF and still don't understand why FASTQ is required. The paper uses compressed sequences as input to an algorithm to determine sequence relatedness. Surely any format would do. I'm also wary of this kind of publication. It looks like computer scientists with little grasp of biology trying to solve a problem that doesn't exist.

ADD REPLY • link 13.3 years ago by Neilfws 49k

2

Entering edit mode

You are maybe confused about DSRC: What DSRC can do? "DSRC is able to:

compress files from DNA sequencing in FASTQ format, decompress whole file, decompress only a single record without decompressing the complete file. " Nothing about phylogenetic trees.

ADD REPLY • link 13.3 years ago by Michael 55k

2

Entering edit mode

OK, so you want to use DSRC for compression and DSRC works only with FASTQ. I think you should reconsider the approach. You won't find long sequences such as chromosomes in FASTQ format, because FASTQ is typically used to represent sequencing reads - very often, short reads. So I'd use FASTA instead and use another tool which can compress that.

ADD REPLY • link 13.3 years ago by Neilfws 49k

1

Entering edit mode

I know but phylogenetic trees are constructed with the help of compression programs. It may get only approximate results but it suffices in my case. Details are here: http://www1.spms.ntu.edu.sg/~chenxin/paper/GIW99.pdf

ADD REPLY • link 13.3 years ago by Martyix ▴ 120

0

Entering edit mode

@neilfws: FASTQ is required since DSRC is able to compress only this format and it fails on any other. I'm completely aware of the fact that I'm doing a task that I don't completely understand but we all started somehow :-)

ADD REPLY • link 13.3 years ago by Martyix ▴ 120

0

Entering edit mode

neilfws: Oh, thank you for the information! I know I can use general purpose compression algorithms for FASTA format but I would really like to find a special algorithm that is created for this purpose (I was so silly to presume there are many of these but it is nontrivial task for me to find a paper and an implementation). Could you please tell me if you know any such specialized algorithm that would be able to do that? (except gencompress and biocompress)

ADD REPLY • link 13.3 years ago by Martyix ▴ 120

score 3 · Answer 1 · 2012-01-09

Ignoring the details in the comments and answering the title of your question literally, the largest source for publicly available sequence in FASTQ format is probably NCBI SRA. You'll need to convert from the SRA format to FASTQ using the SRA SDK. See the help for the site for details.

As for querying what information is available in SRA, you might take a look at our SRAdb package. Note that the SQLite database that is what that package is based on is useful from any language for which SQLite bindings are available (nearly all languages).

score 1 · Answer 2 · 2012-01-09

There might be a better way to do this, but NCBI SRA allows one to query for NCBI taxonomy ids like this -- here the example is for "Mammalia"[Organism] NOT "Homo sapiens"[Organism]:

http://www.ncbi.nlm.nih.gov/sra?term="Mammalia"[Organism] NOT "Homo sapiens"[Organism]

Then you can click on the dropdown menu on the right hand-side, Find related data -> Database:
"Links to Taxonomy", click on "Find Items"

Trying it right now, I get Results: 1 to 20 of 50, so 50 different species that have FASTQ files from next-generation sequencing runs.