Hi everyone. I would like to ask some doubts about the representation of data of FastQC. I have two cleaned files 1.fq and 2.fq of Illumina. When I make a analysis with FastQC. I saw that first file does not have any contigs with Phred score lower of 20. In contrast the second file shows a lot of contigs with lower Phred score of 20. However, when I merged both files into one. This unique file does not show a Phred score low of 20. I attached the images. Somebody can me explain the why of this? is it trustful?
Regards.
File 1
File 2
File All merged
Thanks. Then, do you think that is useful to eliminate this sequences with lower quality? I did it. It down from 50M of sequences to 35M.
there's no general rule for this, it really depends on the dataset. anyway losing 30% of the reads is usually a bit too much and it doesn't seem that you need such a strong quality selection. what program did you use for quality control? I would suggest you to relax a bit its options, in order to save some more reads. remember that a phred quality score of 20 means that the estimated probability of a wrong call is 1%, so it's still very likely to be correct.
it would be useful to have a look at the "per sequence quality score", just to know if you have a lot of reads that have an overall very bad quality. it is also produced by fastqc.