Remove reads containing a specific sequence.
1
0
Entering edit mode
4.0 years ago
agtbeeman • 0

Hello guys,

I have a sample.fastq file containing 20,000,000 reads. After doing a fastqc I notice an over-represented sequence "GGTCATCTGCAAGTGATTTTACTTCCGCCTAGTGAGAAGTGGAACAATTA" with a count of 50,000 according to fastqc.

For some reasons I am interested to persue an analysis without the reads containing this sequence.

  • However, grep -c show a higher count...

    grep -c  'GGTCATCTGCAAGTGATTTTACTTCCGCCTAGTGAGAAGTGGAACAATTA' sample.fastq
    167498
    
  • Anyway I have ran the following using fastx toolkit in the purpose to delete these reads (http://hannonlab.cshl.edu/fastx_toolkit/commandline.html) :

    fastx_clipper -a GGTCATCTGCAAGTGATTTTACTTCCGCCTAGTGAGAAGTGGAACAATTA -C -i sample.fq -o clipped.fastq
    

grep -c 'GGTCATCTGCAAGTGATTTTACTTCCGCCTAGTGAGAAGTGGAACAATTA' clipped.fq give 0 which is good. BUT when I count the number of reads it s only about 12,000,000 meaning 8,000,000 reads have been discarded or I expected a maximum of 167498 reads being discarded. (maybie the tools also deleted reads containing subsequences of this sequence ?).

Does anyone know how to fix it or know another tool more adapted to my problem ?

rna-seq sequence software error • 1.4k views
ADD COMMENT
1
Entering edit mode
4.0 years ago
GenoMax 147k

You can use bbduk.sh from BBMap suite in filter mode to eliminate reads containing these sequences. A guide is available for bbduk.

Basically something like following will work:

bbduk.sh -Xmx2g in=your.fastq outu=clean.fastq literal=GGTCATCTGCAAGTGATTTTACTTCCGCCTAGTGAGAAGTGGAACAATTA k=15

Add rcomp=f if you want to remove only exact sequence and not its reverse complement.

ADD COMMENT
0
Entering edit mode

Thank you genomax ! it deleted all the reads containing the sequence. However it has deleted 400,000 reads instead of 167,498. I am not sure bu I think it is relative to k-mer size I set it to maximum and it has deleted 300,000 reads. (which is better than fastx !)

But if it has deleted also reads with very close sequence too or with subsequences not too short it's fine to me !

ADD REPLY
1
Entering edit mode

Take a look at the inline options for bbduk.sh to adjust the parameters to suite you needs. You can set hdist=0 to allow only perfect matches. Also consider the reverse complement situation I described above. grep only looks at direct matches so it is not going to understand reverse complement of sequence of interest, if present in your reads. So bbduk.sh is doing the right thing.

ADD REPLY

Login before adding your answer.

Traffic: 1664 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6