Hi Biostars, I am new to handling sequencing data so this is my first time posting here!
So I am trying to remove all reads with any of their base pairs below a threshold quality score, here 32.
I have tried using the FASTQ quality_filter using fastq_quality_filter -q 32 -p 100 -o output.fastq.gz
but when I plot this with FastQC my plot has boxplots with terrible quality score values. When I lower the -p
to something like 95 then the quality scores are very good.
Here is the -p 100
plot:
Here is the -p 95
plot:
Why is this?
I then got tired of the FASTQ package with its lacking documentation and downloaded prinseq (using Bioconda). I tried using this command prinseq-lite.pl -fastq stdin -min_qual_score 32 -out_good null -out_bad stdout | gzip > output.fastq.gz
Which is meant to filter any read that has at least one base with a quality score below 32. (At first I tried saving only the -out_good
but it seems that what is "good" is what passed the filter and is therefore all the reads with a min_qual_score below 32, so now I am taking the -out_bad
).
Here is the "Good":
Here is the "Bad":
But this still fails me and creates yet another boxplot that has some bases with whiskers that go below 32.
I have no idea why none of my attempts are working to achieve my goal and would love help to know why.
Moreover, I have no idea what the "gold standard" for quality of sequencing reads is. If my currently plan is really dumb I'd love to know what would be better for me to do.
Thanks again in advance for any help!
single read below a threshold or it is single base (in title)?
Sorry its a single base. I have corrected this.
Please use these instructions to add images to your original post from your FastQC scans so we can see what your plots look like: How to add images to a Biostars post
There is no gold standard for Q scores. It all depends on what you need to do. In general, if you are aligning to a good reference genome then you can get away without doing any Q-score filtering. Q scores down to Q10-15 will work in this case.
If you are planning on doing de novo assembly work then it would be justifiable to use Q-score filtering at Q25 and above.
Thanks re the gold standard and I have now saved and uploaded all of the relevant plots!
Have you scanned/trimmed this data for presence of Illumina adapter sequences? It is possible that you have a few short reads and the whiskers may represent those bases (which generally will get low Q scores). In general your data has median Q scores of >32 as expected. Is this a new dataset with Sanger encoding or are you using something downloaded from SRA?
I also recommend that you take a look at bbduk.sh program from BBMap suite. It is a scan/trim program among other things and has easily understandable options.