Filtering Read Quality - Cant remove all reads that have a single base below a threshold
0
0
Entering edit mode
6.0 years ago
tbb21 ▴ 10

Hi Biostars, I am new to handling sequencing data so this is my first time posting here!

So I am trying to remove all reads with any of their base pairs below a threshold quality score, here 32.

I have tried using the FASTQ quality_filter using fastq_quality_filter -q 32 -p 100 -o output.fastq.gz but when I plot this with FastQC my plot has boxplots with terrible quality score values. When I lower the -p to something like 95 then the quality scores are very good.

Here is the -p 100 plot:

p100

Here is the -p 95 plot:

p95

Why is this?

I then got tired of the FASTQ package with its lacking documentation and downloaded prinseq (using Bioconda). I tried using this command prinseq-lite.pl -fastq stdin -min_qual_score 32 -out_good null -out_bad stdout | gzip > output.fastq.gz Which is meant to filter any read that has at least one base with a quality score below 32. (At first I tried saving only the -out_good but it seems that what is "good" is what passed the filter and is therefore all the reads with a min_qual_score below 32, so now I am taking the -out_bad).

Here is the "Good":

Good

Here is the "Bad":

Bad

But this still fails me and creates yet another boxplot that has some bases with whiskers that go below 32.

I have no idea why none of my attempts are working to achieve my goal and would love help to know why.

Moreover, I have no idea what the "gold standard" for quality of sequencing reads is. If my currently plan is really dumb I'd love to know what would be better for me to do.

Thanks again in advance for any help!

filter fastx quality-score sequencing • 2.7k views
ADD COMMENT
0
Entering edit mode

single read below a threshold or it is single base (in title)?

ADD REPLY
0
Entering edit mode

Sorry its a single base. I have corrected this.

ADD REPLY
0
Entering edit mode

Please use these instructions to add images to your original post from your FastQC scans so we can see what your plots look like: How to add images to a Biostars post

There is no gold standard for Q scores. It all depends on what you need to do. In general, if you are aligning to a good reference genome then you can get away without doing any Q-score filtering. Q scores down to Q10-15 will work in this case.

If you are planning on doing de novo assembly work then it would be justifiable to use Q-score filtering at Q25 and above.

ADD REPLY
0
Entering edit mode

Thanks re the gold standard and I have now saved and uploaded all of the relevant plots!

ADD REPLY
0
Entering edit mode

Have you scanned/trimmed this data for presence of Illumina adapter sequences? It is possible that you have a few short reads and the whiskers may represent those bases (which generally will get low Q scores). In general your data has median Q scores of >32 as expected. Is this a new dataset with Sanger encoding or are you using something downloaded from SRA?

I also recommend that you take a look at bbduk.sh program from BBMap suite. It is a scan/trim program among other things and has easily understandable options.

ADD REPLY

Login before adding your answer.

Traffic: 1715 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6