I have found the DNA corpus and it contains a few DNA sequences (chromosomes) of yeast, mouse, athaliana and human. It is in FASTA (only letters a,c,t,g) format but as far as I know it is not possible to convert this format to FASTQ because I don't have quality numbers (I tried a few tools but none of them worked for me).
My motivation is: I would like to create a phylogenetic tree with DSRC which accepts only FASTQ format. This work is for my elementary course of bioinformatics.
Thanks!
I've skimmed that PDF and still don't understand why FASTQ is required. The paper uses compressed sequences as input to an algorithm to determine sequence relatedness. Surely any format would do. I'm also wary of this kind of publication. It looks like computer scientists with little grasp of biology trying to solve a problem that doesn't exist.
You are maybe confused about DSRC: What DSRC can do? "DSRC is able to:
compress files from DNA sequencing in FASTQ format, decompress whole file, decompress only a single record without decompressing the complete file. " Nothing about phylogenetic trees.
OK, so you want to use DSRC for compression and DSRC works only with FASTQ. I think you should reconsider the approach. You won't find long sequences such as chromosomes in FASTQ format, because FASTQ is typically used to represent sequencing reads - very often, short reads. So I'd use FASTA instead and use another tool which can compress that.
I know but phylogenetic trees are constructed with the help of compression programs. It may get only approximate results but it suffices in my case. Details are here: http://www1.spms.ntu.edu.sg/~chenxin/paper/GIW99.pdf
@neilfws: FASTQ is required since DSRC is able to compress only this format and it fails on any other. I'm completely aware of the fact that I'm doing a task that I don't completely understand but we all started somehow :-)
neilfws: Oh, thank you for the information! I know I can use general purpose compression algorithms for FASTA format but I would really like to find a special algorithm that is created for this purpose (I was so silly to presume there are many of these but it is nontrivial task for me to find a paper and an implementation). Could you please tell me if you know any such specialized algorithm that would be able to do that? (except gencompress and biocompress)