Hello,
I am doing some rna-seq analysis specific to transposable elements and repeats. Since we are interested in repeat elements, we are focusing on retaining multi-mapping reads in our data. I have tried using STAR and Bowtie2 for alignment but they give me very different percentages of multimapping reads.
With STAR I typically get around 5% of reads as multimapping. Even when I increase the parameter --outFilterMultiMapNmax to a high value such as 100 or 1000, this percentage doesn't change too much. With Bowtie2, on the other hand I get 20-30% reads labelled as aligning to multiple positions. I find this very confusing. Since the human genome is around 50% repetitive elements shouldn't the percentage of multimapping reads be much higher than what STAR is giving me?
Any one know what the typical percentage of multimapping reads should be when aligning rna-seq data to the human genome?
Thanks
How are you judging the multi-mapping? Are you looking at
NH
tag?No, I am looking at the percentage of multimapping reads reported in the log files produced by STAR/Bowtie2. Is this not the correct approach?
You know you're aligning RNAseq?
Oh, I should have mentioned I also have h3k9me3 chip-seq which is known to bind repeats. I see similar percentage of multimappers in that data. And even for RNA-Seq, I was expecting a higher percentage as literature suggests that most of the genome is transcribed.
Most of the genome is transcribed _at some point in some cell types_ - which absolutely should not be interpreted like "most of the genome is transcribed all of the time". A typical cell expresses about 6,000 genes, and the distribution of those is ~ negative binomial, meaning that 10% of the highly expressed genes make up about 90% of all RNA-seq reads.
To address your question, there are few points to make:
Bowtie2
is not the right tool to map RNA-seq because it would map spliced reads incorrectly. This is probably what leads to this high number of multimappers - it could only map a small portion of a read, and, like I mentioned before, shorter reads lead to more multimappers.I see ~8% multiple mappers with Ribo-depleted, PE150 RNAseq, with mouse. (STAR) I do see about 50% multi-mappers with K9me3 (bowtie2)
Double checking you also adjusted --winAnchorMultimapNmax ?
To answer the difference between STAR and bowtie2, have you compared the overall mapping percentage? Is STAR assigning the same reads as unique or as unmapped?
Most of the genome is transcribed, but the level of transcription of each region or gene is not the same. Many repetitive elements will be silenced or rapidly degraded after transcription, so you would expect to see a low percentage of these reads in your RNA-seq.
Thanks. It is good to know that you are getting a similar percentage of multimappers with STAR. I analyzed some published datasets and STAR is giving me 5-15% multimappers for both ChIP-Seq and RNA-Seq data. H3k9me3 ChIP-Seq is giving me ~12% in STAR and ~40% in Bowtie2. And yes I adjusted --winAnchorMultimapNmax as well. STAR is not giving me many unmapped reads so it is probably assigning them as unique but I am not sure which of the aligners is giving the "correct" percentage of multimappers.
STAR is splice aware where as bowtie2 is not. So some of the difference may be because of that. Techniques where you are looking for specific areas of sequence enrichment will likely show multi-mapping since the sequences being enriched may be shorter compared toRNAseq. If you want you could use
bbmap.sh
(the aligner from BBMap suite). It is splice aware and using the optionambig=all
will report every location that a read maps equally well to.While imperfect you may need to depend on MAPQ to verify multi=mapping (https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/ )