Hi @ll!
I have a question regarding the way the Illumina pipeline generates its quality check status in the qseq files (11th column according to information from here: http://jumpgate.caltech.edu/wiki/QSeq):
Please take a look at this (representative) example (I've removed the machine ID):
1st paired-end read: HWUSI-XXXXXX 11 7 120 19847 19200 0 1 .AATGATATAGAATGGAATTGAATGGAATGTGCGTGAATGGAATG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 1
2nd paired-end read: HWUSI-XXXXXX 11 7 120 19847 19200 0 3 TCTATTCCTTTCTAATCCATTCAATTCCATTTCATTCGATTCCAT hfhghhcghhhghhhfff]ffhgchhhcghfheehdfdfafffff 1
According to my interpretation of the qseq data format, the 1st paired-end read has passed Illumina QC ("1" in last column of the line), even though the whole read should be disregarded according to PHRED score B(=2). How is it then possible that this read passed the QC? This is one pair of a paired end read, and the matching read from the second file has actually passed the QC and does have a better overall PHRED score (see above) - could this be the reason? I.e. does the Illumina pipeline consider the "overall" quality of a sequence if it is a pair-ended read?
My issue is that nearly 10% of the reads fall into this category (QC passed, yet Bs for all positions). At this stage I am planning to remove these reads prior to alignment, but I would appreciate some comments/answers from people who have seen similar reads in their experiments.
Thanks in advance!
Thanks, but I've read this paragraph several times now, and it does not help me understand how a read can pass QC if all bases have Phred score of "2". If this is a read segment quality control indicator, then - naively - I would assume that a read only with "2"/"B" would not pass QC. And even more confusingly, there are reads that have only "2"/"B" Phred scores and in fact do NOT pass QC (0 in column 11).
I'm no expert on the criteria the Illumina pipeline applies within its QC, but I thought the important part of the passage I quoted was "This Q2 indicator [...] indicates that a specific final portion of the read should not be used in further analyses." IE, you would be doing the right thing in removing those reads.
And the other part of the point of the quote is that all the Bs do NOT mean the quality was extremely low, just that it was "mostly Q15 or below". The QC filtering as it currently works wouldn't necessarily be expected to filter these out (although I agree that it might be helpful for them to modify it so anything that "should not be used in further analyses" would be filtered out).