Removing low complexity reads from RNA-seq

3

Entering edit mode

9.8 years ago

Asaf 10k

I'm mapping bacterial RNA-seq to the genome and found something very weird. A very abundant RNA has a AATAATAAT repeat somewhere in the middle and I have a lot of reads which map to the gene (it's paired end sequencing and the second mate maps) but the number of repeats is larger (up to 8 AAT repeats). Since I have several millions reads of this gene and it's a small-RNA I get thousands such reads. I'm trying to figure out their source, whether it's biological or just artifact of the RT/PCR/sequencing etc.

To overcome this issue I started screening the reads for low-complexity reads and removing them (using dust filter) which seems to work. My questions are:

Is it common to remove low-complexity reads from the data?
Why should they be removed? Is it because the mapping will be difficult or wrong or the reads are probably a result of an error?

Thanks

RNA-Seq low-complexity qc • 5.4k views

ADD COMMENT • link 9.8 years ago by Asaf 10k

0

Entering edit mode

Are you using a mapping quality filter before generating counts? using a mapQ threshold >10 will remove most low-complexity reads because they map to many places in the genome. HTSeq, Bedtools and FeatureCounts all have facility to do this.

ADD REPLY • link 9.7 years ago by mark.ziemann ★ 2.0k

0

Entering edit mode

My analysis is a bit different, I don't remove multiple mapped reads. In addition I work on bacteria so the genome is much smaller.