When quality trimming raw reads to a phred score with many of the numerous tools out there (I prefer to use bbduk), there is often conflicting advice surrounding to what Q score one should filter too.
I have read many of Brian Bushnell's (the mind behind bbtools) post and replies and from what I gather there really is no right answer as it depends on what you want to do.
Say for example I have some illumina short reads and some pacbio long reads that I will to do the following with:
- Hybrid assembly
- Illumina only de novo assembly
- Pacbio only de novo assembly
I have been previously told that a score of Q30 is highly typical to use however I have read that anything above 27 is just unnecessary and potentially damaging to generating good assemblies. On the bbduk help pages on Seqanswers Q10 is used in every example.
Thus is Q10 a good general base point for quality trimming to provide decent assemblies or is this also to high?
How one one work out the above on their own without needing to come here to answers?
Secondly, should we trim pacbio reads in either a hybrid assembly or pacbio de novo assembly?
@genomax Thank you very much for the response. It makes sense to me that for de novo assembly one should trim to a higher phred score, would 30 be an acceptable mid ground?
For pacbio reads should you also trim based on quallity?
Q30 should be a good compromise for
de novo
work. If your data has sequencing anomalies (e.g. regions of low scores in middle of reads etc perhaps due to a bubble in the lane) then using a windowed Q score average orfilterbytile.sh
may be safer. Ideally you should discard such data but ...I have not worked with PacBio data recently but for latest Sequel II data following is noted.
Thank you very much. I have just tried a Q30 filtering run and get back some quite weird results when looking at fastQC.
Pre filtering
Post filtering
How so? Before filtering all of your reads were 150 bp (FastQC just plots them like that). After filtering you have a range of sizes. You will want to implement some length filtering to eliminate short reads since they may not be very useful for the assembly (which is the next step?).
Ahh okay so that is normal, I was not aware that was the case. thank you! I just assume it binned the whole read if had low quality reads on it.
Is it unusual to get warning on per base sequence content? I have aprox a 20% difference between AT and CT.