Question

Typical percentage of multimapping reads in human rna-seq?

1

Entering edit mode

2.2 years ago

srhic ▴ 70

Hello,

I am doing some rna-seq analysis specific to transposable elements and repeats. Since we are interested in repeat elements, we are focusing on retaining multi-mapping reads in our data. I have tried using STAR and Bowtie2 for alignment but they give me very different percentages of multimapping reads.

With STAR I typically get around 5% of reads as multimapping. Even when I increase the parameter --outFilterMultiMapNmax to a high value such as 100 or 1000, this percentage doesn't change too much. With Bowtie2, on the other hand I get 20-30% reads labelled as aligning to multiple positions. I find this very confusing. Since the human genome is around 50% repetitive elements shouldn't the percentage of multimapping reads be much higher than what STAR is giving me?

Any one know what the typical percentage of multimapping reads should be when aligning rna-seq data to the human genome?

Thanks

RNA-seq alignment Bowtie2 STAR • 3.3k views

ADD COMMENT • link updated 13 months ago by predeus ★ 2.1k • written 2.2 years ago by srhic ▴ 70

0

Entering edit mode

How are you judging the multi-mapping? Are you looking at NH tag?

ADD REPLY • link 2.2 years ago by GenoMax 153k

0

Entering edit mode

No, I am looking at the percentage of multimapping reads reported in the log files produced by STAR/Bowtie2. Is this not the correct approach?

ADD REPLY • link 2.2 years ago by srhic ▴ 70

0

Entering edit mode

Since the human genome is around 50% repetitive elements

You know you're aligning RNAseq?

ADD REPLY • link 2.2 years ago by Shred ★ 1.6k

0

Entering edit mode

Oh, I should have mentioned I also have h3k9me3 chip-seq which is known to bind repeats. I see similar percentage of multimappers in that data. And even for RNA-Seq, I was expecting a higher percentage as literature suggests that most of the genome is transcribed.

ADD REPLY • link 2.2 years ago by srhic ▴ 70

0

Entering edit mode

Most of the genome is transcribed _at some point in some cell types_ - which absolutely should not be interpreted like "most of the genome is transcribed all of the time". A typical cell expresses about 6,000 genes, and the distribution of those is ~ negative binomial, meaning that 10% of the highly expressed genes make up about 90% of all RNA-seq reads.

To address your question, there are few points to make:

Fraction of multimappers in genome reads is dependent on your fragment size (for paired-end reads) and your read size (for single-end reads). The shorter both are, the more multimappers you would get, because shorter reads generate more ambiguity.
Fraction of multimappers in RNA-seq depends on which genes are expressed, and how well did rRNA depletion worked. Most of the reads that are multimappers in RNA-seq are in fact reads that map to rRNA. Few protein coding genes are so similar that they could not be distinguished with a paired-end read - maybe about 1 to 2%.
Bowtie2 is not the right tool to map RNA-seq because it would map spliced reads incorrectly. This is probably what leads to this high number of multimappers - it could only map a small portion of a read, and, like I mentioned before, shorter reads lead to more multimappers.
Transposable elements are not strongly expressed, and would not influence the overall picture too much. I would recommend reading papers describing RNA-seq applications to find transposable elements and repeats, there are quite a few of those.

ADD REPLY • link 13 months ago by predeus ★ 2.1k

0

Entering edit mode

I see ~8% multiple mappers with Ribo-depleted, PE150 RNAseq, with mouse. (STAR) I do see about 50% multi-mappers with K9me3 (bowtie2)

Double checking you also adjusted --winAnchorMultimapNmax ?

To answer the difference between STAR and bowtie2, have you compared the overall mapping percentage? Is STAR assigning the same reads as unique or as unmapped?

Most of the genome is transcribed, but the level of transcription of each region or gene is not the same. Many repetitive elements will be silenced or rapidly degraded after transcription, so you would expect to see a low percentage of these reads in your RNA-seq.

ADD REPLY • link 2.2 years ago by rfran010 ★ 1.6k

0

Entering edit mode

Thanks. It is good to know that you are getting a similar percentage of multimappers with STAR. I analyzed some published datasets and STAR is giving me 5-15% multimappers for both ChIP-Seq and RNA-Seq data. H3k9me3 ChIP-Seq is giving me ~12% in STAR and ~40% in Bowtie2. And yes I adjusted --winAnchorMultimapNmax as well. STAR is not giving me many unmapped reads so it is probably assigning them as unique but I am not sure which of the aligners is giving the "correct" percentage of multimappers.

ADD REPLY • link 2.2 years ago by srhic ▴ 70

0

Entering edit mode

STAR is splice aware where as bowtie2 is not. So some of the difference may be because of that. Techniques where you are looking for specific areas of sequence enrichment will likely show multi-mapping since the sequences being enriched may be shorter compared toRNAseq. If you want you could use bbmap.sh (the aligner from BBMap suite). It is splice aware and using the option ambig=all will report every location that a read maps equally well to.

While imperfect you may need to depend on MAPQ to verify multi=mapping (https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/ )

ADD REPLY • link 2.2 years ago by GenoMax 153k