I'm mapping bacterial RNA-seq to the genome and found something very weird. A very abundant RNA has a AATAATAAT repeat somewhere in the middle and I have a lot of reads which map to the gene (it's paired end sequencing and the second mate maps) but the number of repeats is larger (up to 8 AAT repeats). Since I have several millions reads of this gene and it's a small-RNA I get thousands such reads. I'm trying to figure out their source, whether it's biological or just artifact of the RT/PCR/sequencing etc.
To overcome this issue I started screening the reads for low-complexity reads and removing them (using dust filter) which seems to work. My questions are:
- Is it common to remove low-complexity reads from the data?
- Why should they be removed? Is it because the mapping will be difficult or wrong or the reads are probably a result of an error?
Thanks
Are you using a mapping quality filter before generating counts? using a mapQ threshold >10 will remove most low-complexity reads because they map to many places in the genome. HTSeq, Bedtools and FeatureCounts all have facility to do this.
In that case, "Dusting" or "RepeatMasking" reads would seem appropriate.