Question

Why is the quality of contigs better with a merged file than separated files? Using Illumina and FastQC

1

Entering edit mode

10.1 years ago

margxenscienculo ▴ 50

Hi everyone. I would like to ask some doubts about the representation of data of FastQC. I have two cleaned files 1.fq and 2.fq of Illumina. When I make a analysis with FastQC. I saw that first file does not have any contigs with Phred score lower of 20. In contrast the second file shows a lot of contigs with lower Phred score of 20. However, when I merged both files into one. This unique file does not show a Phred score low of 20. I attached the images. Somebody can me explain the why of this? is it trustful?

Regards.

File 1

file 1 image

File 2

file 2 image

File All merged

fastqc illumina quality phred-score • 3.2k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 10.1 years ago by margxenscienculo ▴ 50

2

Entering edit mode

10.1 years ago

Antonio R. Franco ★ 5.2k

It is very likely that with that overall quality of data, you can remove that bad sequences without any problem. Depending upon the program you use and the way you decide to do the trimming is is possible that those bad sequences ended being a little more shorter

ADD COMMENT • link 10.1 years ago by Antonio R. Franco ★ 5.2k

0

Entering edit mode

All file merged

all file merged image

All file without -20 Phred score. (I did with biopython script)

all file without -20 phred image

I will assembled with Soapdenovo with a web-online supercomputer (database of japan). Because I have not a own supercomputer. My computer is not enough with only 4 RAM :). I will test the two files.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 10.1 years ago by margxenscienculo ▴ 50

Ram · Accepted Answer · 2015-04-17

2

Entering edit mode

10.1 years ago

Martombo ★ 3.2k

simply, in these boxplots the lower whisker represents the 10th percentile, which means that 10% of the points are lower than that and are not displayed. when you merge two datasets the 10% will be higher, exactly as you see in your data. there are still quite a considerable number of sequences which have bases with a lower quality than 20 at the end.

ADD COMMENT • link 10.1 years ago by Martombo ★ 3.2k

1

Entering edit mode

Thanks. Then, do you think that is useful to eliminate this sequences with lower quality? I did it. It down from 50M of sequences to 35M.

< image not found >

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 10.1 years ago by margxenscienculo ▴ 50

1

Entering edit mode

there's no general rule for this, it really depends on the dataset. anyway losing 30% of the reads is usually a bit too much and it doesn't seem that you need such a strong quality selection. what program did you use for quality control? I would suggest you to relax a bit its options, in order to save some more reads. remember that a phred quality score of 20 means that the estimated probability of a wrong call is 1%, so it's still very likely to be correct.

it would be useful to have a look at the "per sequence quality score", just to know if you have a lot of reads that have an overall very bad quality. it is also produced by fastqc.

ADD REPLY • link 10.1 years ago by Martombo ★ 3.2k