Hello guys,
I have a sample.fastq file containing 20,000,000 reads. After doing a fastqc I notice an over-represented sequence "GGTCATCTGCAAGTGATTTTACTTCCGCCTAGTGAGAAGTGGAACAATTA" with a count of 50,000 according to fastqc.
For some reasons I am interested to persue an analysis without the reads containing this sequence.
However, grep -c show a higher count...
grep -c 'GGTCATCTGCAAGTGATTTTACTTCCGCCTAGTGAGAAGTGGAACAATTA' sample.fastq 167498
Anyway I have ran the following using fastx toolkit in the purpose to delete these reads (http://hannonlab.cshl.edu/fastx_toolkit/commandline.html) :
fastx_clipper -a GGTCATCTGCAAGTGATTTTACTTCCGCCTAGTGAGAAGTGGAACAATTA -C -i sample.fq -o clipped.fastq
grep -c 'GGTCATCTGCAAGTGATTTTACTTCCGCCTAGTGAGAAGTGGAACAATTA' clipped.fq give 0 which is good. BUT when I count the number of reads it s only about 12,000,000 meaning 8,000,000 reads have been discarded or I expected a maximum of 167498 reads being discarded. (maybie the tools also deleted reads containing subsequences of this sequence ?).
Does anyone know how to fix it or know another tool more adapted to my problem ?
Thank you genomax ! it deleted all the reads containing the sequence. However it has deleted 400,000 reads instead of 167,498. I am not sure bu I think it is relative to k-mer size I set it to maximum and it has deleted 300,000 reads. (which is better than fastx !)
But if it has deleted also reads with very close sequence too or with subsequences not too short it's fine to me !
Take a look at the inline options for
bbduk.sh
to adjust the parameters to suite you needs. You can sethdist=0
to allow only perfect matches. Also consider thereverse complement
situation I described above.grep
only looks at direct matches so it is not going to understand reverse complement of sequence of interest, if present in your reads. Sobbduk.sh
is doing the right thing.