Hello all,
I have observed an anomaly in all Illumina 1.5 pipeline data. I use Biopieces www.biopieces.org) for trimming my data - and trim_seq basically removes residues below a given threshold from the ends. When I plot the length distribution after trimming I observe peaks for every 5 residues.
readfastq -n 1000000 -i in.fastq | trimseq | plotlendist -k SEQLEN -x
Length Distribution
12000 ++----------------------------------------------------------------++
| |
| |
10000 ++ **+
| **|
| **|
8000 ++ **+
| **|
6000 ++ ** **+
| ** **|
| ** **|
4000 ++ * *** **+
| * *** ***|
| ** *** ***** ***|
2000 ++ ** *** **** ***********+
| ** *** ***********************|
|** ***** **** ******************************************|
0 +******************************************************************+
+ + + + + + + + + + +
0 5 10 15 20 25 30 35 40 45 50
Forensics indicate that the presence of B in the quality scores is the problem. If I remove all records containing any B then the anomaly disappears:
readfastq -n 1000000 -i in.fastq | grab -p B -k SCORES -i | trimseq | plotlendist -k SEQLEN -x
Length Distribution
4000 ++-----------------------------------------------------------------++
| **|
3500 ++ **+
| **|
3000 ++ **+
| **|
2500 ++ **+
| **|
2000 ++ **+
| **|
| **|
1500 ++ **+
| **|
1000 ++ **+
| **|
500 ++ **+
| ***|
0 ++------+------+-----+------+------+-----+------+------+-----+******+
+ + + + + + + + + + +
0 5 10 15 20 25 30 35 40 45 50
And for all records with B's:
readfastq -n 1000000 -i in.fastq | grab -p B -k SCORES | trimseq | plotlendist -k SEQLEN -x
Length Distribution
7000 ++-----------------------------------------------------------------++
| ** |
6000 ++ ** **+
| ** **|
| ** **|
5000 ++ ** **+
| ** ** **|
4000 ++ ** *** **+
| ** *** ***|
| ** *** ***|
3000 ++ * *** ***** ***+
| * *** ***** ****|
2000 ++ ** *** *** ***********+
| ** *** *** ****************|
| *** *** ***********************|
1000 +** ** ** *** ************************************+
|** ********************************************************|
0 +*******************************************************************+
+ + + + + + + + + + +
0 5 10 15 20 25 30 35 40 45 50
Now, according to Wikipedia:
http://en.wikipedia.org/wiki/FASTQ_format#Encoding
and the docs I have been able to find (page 32):
http://www.scribd.com/doc/48889532/CASAVA1-7-User-Guide-15011196-A
B or Q2 is used as an indicator that a sequence residue quality is substandard, but don't really have a quality score. trim_seq will regard B as Q2 and discard the residue - and to the best of my understanding - that is OK.
But I don't understand the cyclic behaviour I observer. 10-20% of all records contain a B, so I will loose a lot of data by filtering those reads.
Anyone?
(and why do Illumina keep changing FASTQ encoding?)
Cheers,
Martin
To be safe the correct thing would be to discard reads with B in the quality scores. But I would like to understand the problem - I am baffled about the cyclic anomaly. Fully understanding this may allow me to simple trim these reads instead of discarding them saving up to 20% of my data!
Have you tried FastQc? May be it gives you a little bit more info about your problem.
On the other hand, if it's bad data it will only spoil you analysis, so you'd better filter it out. You might just have a bad run from the sequencer (there is no way to recover from that). In an extreme case you might just need to do the experiment again.
I contacted a friend. He says that B does not necessarily indicates a bad residue read. And that the insertions of B's can be skipped using some flag in GERALD or the bcl converter. However, that does not clarify the meaning of the Bs - or especially the fact that they mostly occur every 5 nucleotides. I will test FastQc to see what they do.
Tested FastQc - doesn't tell me anything.