Dear all
Is there any way to remove entire reads from a FastQ file IFF the first "x" bases are of bad quality? I will FastQC the whole read later for a cumulative score but right now I am interested in the barcode region.
Basically I fear barcode contamination and I do not wish to keep the reads which might have lower quality scores in the barcode region / first 8-10 bp. I do not want to trim the reads but completely remove them.
I tried fastx barcode splitter but it searches for the barcode sequence while I want to search the quality of the barcode sequence.
Really sorry if my words sounded redundant but I really hope there is some tool which can do this so I can integrate it in a pipeline.
Thank you!
For me a barcode is a certain marker that is used to identify species like CO1 I don't think that this is what you mean. You should check the quality with fastqc first, I never saw it that the beginning of a read is of bad quality. Also, do you know the sequence of your "barcodes"? If yes you can split on it and discard the reads where no barcode is found. I don't see why you would really look at the quality. Also if you have a certain amount of those barcodes and you are planning to use them more often you should design them in a way that if a basepair changes it can never be the same as an other barcode. (Like the illumina tags)
But back to the question, I don't know such a tool but if you really want do this and know a treshold it is pretty easy to make in python with the help of the bioPython package.
Thanks for the reply @gb. Yes I mean the same thing by 'barcodes' which are used to separate different species/samples. I have a set of 60 barcodes which supposedly permit specificity upto 2 mismatches and I wanted to test that. So I ended up writing a perl script to manually check quality scores of the barcode sequence at the start of the read. And yes I know the sequences ofcourse. Thank you for your time.
Seigfried : Out of curiosity why do you expect low quality for initial 8-10 bases? Can you also state what your definition of low quality is? I have generally never seen Q scores fall below 30 in the initial bases (they generally start out lower before the basecaller has had a chance to calibrate things).
@genomax Hello. Usually I have seen a typical trend of Qscores being quite high at the start of a read and then slowly dipping down at the end. We usually do a FastQC on the data as a standard Quality Control measure. In the current sequencing samples that I have, I see a Qscore going below 28 and staying in the mid-quality range (20-28) at the starting 10bp of the read. This got me worried which led me to investigate further.
If I understand correctly, for one barcode, if the Qscore for a base is not high means that the signal received for that particular base wasn't good enough. So there might be a possibility that this base was called wrongly which might lead to sample mixing if the barcodes aren't unique enough. Therein lies my problem.
Do you know how much data you are throwing away by using this strategy? Even at a cutoff of Q25 the probability of the basecall being incorrect is 1 in 500. Unless you are working with an assay that requires extreme sensitivity (e.g. tumor/normal etc) you may be throwing away data you could otherwise use. How many total indexes are you working with? Are they at least 1/2 bp distinct from each other at any given position?
@genomax I realised that as well. For a good sequencing run, I found that I discarded around 15% of the raw reads. However, after mapping with bowtie2, the difference in the mapping % was around 2% compared to the original sample and the number of mapped reads lost was also around 13%
So basically i'm throwing away 13% of my data which is really bad. Since the mapping% didn't decrease a lot I am assuming it certainly discarded a large amount of the unwanted reads however the losses outweigh the gains.
I am working with 60 indexes and they allow upto 2 mismatches.
Thanks for your time I think this approach isn't feasible.