Question

Quallity trimming by phred score, how far is to far?

0

Entering edit mode

4.1 years ago

robert.murphy ▴ 90

When quality trimming raw reads to a phred score with many of the numerous tools out there (I prefer to use bbduk), there is often conflicting advice surrounding to what Q score one should filter too.

I have read many of Brian Bushnell's (the mind behind bbtools) post and replies and from what I gather there really is no right answer as it depends on what you want to do.

Say for example I have some illumina short reads and some pacbio long reads that I will to do the following with:

Hybrid assembly
Illumina only de novo assembly
Pacbio only de novo assembly

I have been previously told that a score of Q30 is highly typical to use however I have read that anything above 27 is just unnecessary and potentially damaging to generating good assemblies. On the bbduk help pages on Seqanswers Q10 is used in every example.

Thus is Q10 a good general base point for quality trimming to provide decent assemblies or is this also to high?

How one one work out the above on their own without needing to come here to answers?

Secondly, should we trim pacbio reads in either a hybrid assembly or pacbio de novo assembly?

Assembly sequencing • 2.0k views

ADD COMMENT • link updated 4.1 years ago by GenoMax 147k • written 4.1 years ago by robert.murphy ▴ 90

score 2 · Accepted Answer · 2020-10-29

2

Entering edit mode

4.1 years ago

GenoMax 147k

I gather there really is no right answer as it depends on what you want to do.

That is correct. Every dataset is going to have overall characteristics that are dependent on the quality of the source DNA and libraries (and to some extent sequencing). I am not sure if those can be completely captured in a QC report. We have accumulated enough expertise over the years that quality of sequencing should be more or less out of the equation, if the libraries made were of good quality.

If you are working with a de novo genome then it is appropriate to trim at that higher Q25 or more level. If you have a reference genome available that you can compare to then relaxing that threshold may be fine.

we trim pacbio reads in either a hybrid assembly or pacbio de novo assembly?

Definitely trim out barcodes or any other extraneous sequence. Someone else will have to chime in on specifics.

ADD COMMENT • link 4.1 years ago by GenoMax 147k

0

Entering edit mode

@genomax Thank you very much for the response. It makes sense to me that for de novo assembly one should trim to a higher phred score, would 30 be an acceptable mid ground?

For pacbio reads should you also trim based on quallity?

ADD REPLY • link 4.1 years ago by robert.murphy ▴ 90

0

Entering edit mode

Q30 should be a good compromise for de novo work. If your data has sequencing anomalies (e.g. regions of low scores in middle of reads etc perhaps due to a bubble in the lane) then using a windowed Q score average or filterbytile.sh may be safer. Ideally you should discard such data but ...

I have not worked with PacBio data recently but for latest Sequel II data following is noted.

Please note that raw data quality scores are the same for all bases of the Sequel raw data (PHRED 0 — ASCII !). PacBio came to the conclusion that computing the quality scores for the raw data was a waste of time. Apparently the quality scores for the raw data cannot be reliably computed (and consequently these were also ignored for RSII data pipelines).

ADD REPLY • link 4.1 years ago by GenoMax 147k

0

Entering edit mode

Thank you very much. I have just tried a Q30 filtering run and get back some quite weird results when looking at fastQC.

Pre filtering

Post filtering

ADD REPLY • link 4.1 years ago by robert.murphy ▴ 90

0

Entering edit mode

How so? Before filtering all of your reads were 150 bp (FastQC just plots them like that). After filtering you have a range of sizes. You will want to implement some length filtering to eliminate short reads since they may not be very useful for the assembly (which is the next step?).

ADD REPLY • link 4.1 years ago by GenoMax 147k

0

Entering edit mode

Ahh okay so that is normal, I was not aware that was the case. thank you! I just assume it binned the whole read if had low quality reads on it.

ADD REPLY • link 4.1 years ago by robert.murphy ▴ 90

0

Entering edit mode

Is it unusual to get warning on per base sequence content? I have aprox a 20% difference between AT and CT.

ADD REPLY • link 4.1 years ago by robert.murphy ▴ 90