I downloaded data from SRA database and fastqc shows many overrepresented sequences with no hits. I blast some sequences and they match with rRNA and mtDNA. The per sequence GC content is weird due to those contaminants. Should I trim the out before alignment or I should ignore them. I believe they will not align to the reference genome, do they?
The reads quality are all good. My objective is to analyze gene expression and I'm going to align the sequences to gene regions. So, those rRNA sequences may interfere in my analysis? They are supposed to not align to gene, right?
They will align to rRNA genes when aligned to the genome. It can probably skew quantification, so remove rRNA, and map the remaining reads.
I think how well they align (for human) will be determined by whether you include the unassembled contigs in the reference. In reality, the rRNA genes are on several different chromosomes and MT, but most of my rRNA reads align to one of those extra contigs. Not everyone uses a good version with all the extra contigs at the bottom so if you have a lot you could see different alignment rates by not using the good reference.