Question

Fastq Format Redundancy

0

Entering edit mode

11.8 years ago

Irsan ★ 7.8k

Each read in a fastq file makes up 4 consecutive lines (read id, read sequence, qual id and qual string). What do you need the qual id for? Isn't the read id enough for identification? Besides, in most fastq files (if not all?) the qual id is just "+"

fastq • 3.0k views

ADD COMMENT • link updated 11.8 years ago by SES 8.6k • written 11.8 years ago by Irsan ★ 7.8k

0

Entering edit mode

Yup it would be better to have sth like:

Read_ID \t read_sequence \t qual_sequence

ADD REPLY • link 11.8 years ago by Ashutosh Pandey 12k

0

Entering edit mode

I personally like that there are 4 lines as I prefer dividing by 4 to dividing by 3. If I grab first 100/1000/10,000 lines I immediately know how many sequences are there. Or maybe I am just that used to 4 lines that I cannot step back and admit that 3 would be better:))

ADD REPLY • link 11.8 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

There's no reason I know of offhand that fastqs can't be completely supplemented by bam files. Put all your reads in there (marked unmapped), and you've got everything you need. Many aligners can use bam as input these days.

ADD REPLY • link 11.8 years ago by Chris Miller 22k

1

Entering edit mode

alas there is always another catch

the BAM SEQ column will display sequences as they align on the forward strand, so sequences aligned on the reverse strand would need to be reverse complemented to obtain the actual data. In addition hard clipping is a valid alignment representation, but that also means that the some of the original information is lost. Then if we consider spliced alignments getting back the original data is probably even more convoluted especially since there is also read pairing to keep track of.

ADD REPLY • link 11.8 years ago by Istvan Albert 103k

0

Entering edit mode

Picard SamToFastq will take care of the strandedness problem if you need to recreate the original fastqs. If all you're doing is storing raw reads, the hard clipping info won't be an issue either (some would argue that you shouldn't be doing hard clipping anyway). Yes, I agree that the spliced alignments would be a pain, but not intractable - just need one brave soul to write the tool so the rest of us can use it :)

ADD REPLY • link 11.8 years ago by Chris Miller 22k

score 1 · Answer 1 · 2013-11-09

As described in this paper:

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants Nucl. Acids Res. (2010)

The FASTQ format was invented at the turn of the century at the Wellcome Trust Sanger Institute by Jim Mullikin, gradually disseminated, but never formally documented (Antony V. Cox, Sanger Institute, personal communication 2009).

so we can't be all that surprised that the format has some unspecified characteristics.

this prompted me to look up more information on Jim Mullikin, turns out he is a Director at NIH Intramural Sequencing Center

score 0 · Answer 2 · 2013-11-10

I think it's there to serve as a delimiter. One of the most common issues we have to deal with is the problem of line endings caused by going from different operating systems. When Fasta files get messed up because of this you can at least tell where the sequence ends because the header starts with the greater-than sign. It would be chaotic if not for that delimiter. Likewise, I think it would be more difficult trying to find where the sequence ended and the quality line started without this '+' delimiter.