Hello everyone!
I have paired-end human RNA-seq data mapped with Tophat2 and I want to perform differential expression analysis. I know my data have quite high level of ribosomal RNA contamination, so I want to deal with it.
When using cuffidiff, there is this -M/--mask-file option that allows me to support cuffdiff with "filter" .gtf file. I downloaded such file (rRNA + tRNA + mtRNA) from UCSC genome browser and everything worked fine.
Now I want to perform the same analysis using htseq-count + deseq2 and I wander - is there such easy way to deal with rRNA contamination in this case? After lots of googling, I have 3 ideas:
a) Use htseq-count with .gtf file that simply does not contain rRNA genes
b) Take .fastq files with reads, map them to my "filter.gtf" file first, then take only unmapped reads and map them to reference, use resulting .bam for analysis
c) Subtract reads mapping to regions in my "filter.gtf" from my sample .bam file.
What I've already tried:
a) I don't have such such .gtf file. I use hg19 reference downloaded from here: http://tophat.cbcb.umd.edu/igenomes.shtml which seems to contain rRNA genes.
b) Tried this a while ago and had some problems... anyway, this would be rather long solution, I want to try something easier first.
c) This is my favorite. I used bedtools intersectBed with -v option to subtract reads mapping to regions in "filter.gtf" from my .bam file. But for some read pairs, this removes only one read from the pair and therefore causes htseq-count to raise the famous warning: "Read ... claims to have an aligned mate which could not be found. (is the SAM file properly sorter?)"
So finally, my questions:
1) Do you think c) is the right way of removing rRNA contamination?
2) If so, do you think I should just ignore htseq-count warnings? Or should I try to somehow remove these "orphan" reads without mate? (Because htseq-count treats them as reads with mate not mapped. But when one of the mates is mapped to rRNA region, don't I want to always remove both of them?)
Thanks for your ideas!
I like doing an initial alignment to just rRNA sequences using bowtie2 and then do downstream analysis with unmapped reads. You can nicely see what percentage of each file aligned to rRNA, for one thing. Maybe this is like b, but I'm not sure. I don't usually bother to do this unless there's a lot of rRNA contamination, though.
Have you ever looked what difference it makes wrt DEGs when you remove or not remove rRNA reads? We are using Ribozero with a rather large fraction of rRNA reads (~20-25%), so I am wondering if removing rRNA reads improves results.
Note: we are using DESeq2 to detect differentially expressed genes.