Question

how to identify where multi-mappers, or "non primary hits" are mapping to

0

Entering edit mode

4.7 years ago

bertb ▴ 20

Hello,

My RNAseq alignment QC is showing me that I have a fair number of "non primary hits", which, as I understand it, means the read has been mapped to multiple locations. (I created a very small subsample file of my .bam file in order to try and find where these reads are mapping.):

bam_stat.py -i file.bam

Load BAM file ...  Done

#==================================================
#All numbers are READ count
#==================================================

Total records:                          116

QC failed:                              0
Optical/PCR duplicate:                  0
Non primary hits                        54
Unmapped reads:                         21
mapq < mapq_cut (non-unique):           22

mapq >= mapq_cut (unique):              19
Read-1:                                 9
Read-2:                                 10
Reads map to '+':                       9
Reads map to '-':                       10

This seems to also be resulting in a high fraction of "Unassigned_MultiMapping" fragments when I run featureCounts on the file.

Is there a way to investigate or identify what regions these "non primary hits" are mapping (or trying to) map to?

I've tried running samtools view file.bam | grep -v NH:i:1 | perl -pe 's/AS.+(NH:i:\d+)/\$1/' | cut -f1,10,12 | perl -pe 's/NH:i://' | sort -u -k3,3nr > Multi-mapped.sorted.txt just to find out which reads are multi-mapped and how many times, but the file produced only shows one fragment (which is says is mapped once) and so I suspect there isn't an NH tag on the alignment file. Does the hisat2 aligner, which I used, tag multi-mapped genes, or should I try realigning with something like STAR?

Thank you!

RNA-Seq alignment next-gen • 4.3k views

ADD COMMENT • link updated 4.6 years ago by Biostar 20 • written 4.7 years ago by bertb ▴ 20

0

Entering edit mode

Non-primary may mean secondary (i.e., a read mapping equally well to multiple locations), or may mean supplementary (i.e., part of a read that cannot be represented as a linear alignment).

I am not familiar with hisat, but you should have shown the command you used for alignment, as it may help in interpreting what yu are seeing. Also, due to splicing, I believe RNAseq reads will often map as chimeric reads, thus your observation is perfectly normal.

For some background, see How should supplementary alignments be flagged?

ADD REPLY • link 4.7 years ago by h.mon 35k

1

Entering edit mode

Spliced reads aren't mapped as chimeric - the just have Ns in their cigar strings.

ADD REPLY • link 4.7 years ago by i.sudbery 21k

0

Entering edit mode

Thank you for your response.

The command I used for alignment was:

hisat2 -p 8 --rg-id=UWN_t2 --rg SM:UWN_t2 --rg LB:UWN_t2 --rg PL:ILLUMINA --rg PU:CW9PNANXS.8 -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_DATA_DIR/R1.fastq.gz -2 $RNA_DATA_DIR/R2.fastq.gz -S ./UWN_t2.sam

I understand that some degree of non primary alignments are expected, though I should note that I am working with yeast, which has very few introns and splicing events.

My concern is that when I attempt to perform a count with featureCounts, I have what seems like a very low assignment rate:

featureCounts -T 8 -p -a ~/path/sacCer3.ncbiRefSeq.gtf -t exon -g gene_id -o ~/path/counts.txt UWN_t2.bam UMN_t2.bam

|| Load annotation file sacCer3.ncbiRefSeq.gtf ...                            ||
||    Features : 6483                                                         ||
||    Meta-features : 6120                                                    ||
||    Chromosomes/contigs : 17                                                ||
||                                                                            ||
|| Process BAM file UWN_t2.bam...                                             ||
||    Paired-end reads are included.                                          ||
||    Total alignments : 94593515                                             ||
||    Successfully assigned alignments : 6411675 (6.8%)                       ||
||    Running time : 56.70 minutes                                            ||
||                                                                            ||
|| Process BAM file UMN_t2.bam...                                             ||
||    Paired-end reads are included.                                          ||
||    Total alignments : 85660192                                             ||
||    Successfully assigned alignments : 9047788 (10.6%)                      ||
||    Running time : 48.78 minutes  


Status  UWN_t2.bam  UMN_t2.bam
Assigned    6411675 9047788
Unassigned_Unmapped 11911171    8087631
Unassigned_Read_Type    0   0
Unassigned_Singleton    0   0
Unassigned_MappingQuality   0   0
Unassigned_Chimera  0   0
Unassigned_FragmentLength   0   0
Unassigned_Duplicate    0   0
Unassigned_MultiMapping 71367717    62764277
Unassigned_Secondary    0   0
Unassigned_NonSplit 0   0
Unassigned_NoFeatures   1562134 2266206
Unassigned_Overlapping_Length   0   0
Unassigned_Ambiguity    3340818 3494290

I also notice that when I allow multimapping, my assignment rate improves a little bit (still less than 20%), but a large amount of the Unassigned_MultiMapping reads appear to shift to Unassigned_Ambiguity category:

featureCounts -T 8 -p -M -a ~/path/sacCer3.ncbiRefSeq.gtf -t exon -g gene_id -o ~/path/counts_multi.txt UWN_t2.bam UMN_t2.bam

Status  UWN_t2.bam  UMN_t2.bam
Assigned    15023372    14980685
Unassigned_Unmapped 11911171    8087631
Unassigned_Read_Type    0   0
Unassigned_Singleton    0   0
Unassigned_MappingQuality   0   0
Unassigned_Chimera  0   0
Unassigned_FragmentLength   0   0
Unassigned_Duplicate    0   0
Unassigned_MultiMapping 0   0
Unassigned_Secondary    0   0
Unassigned_NonSplit 0   0
Unassigned_NoFeatures   3746435 4156863
Unassigned_Overlapping_Length   0   0
Unassigned_Ambiguity    63912537    58435013

I have also run featureCounts both with and without rrna filters in the .gtf file, and I get similarly poor assignment rates (less than 15%).

This still seems like an unreasonably low assignment rate. I'm open to any recommendations on either how to investigate my low assignment rate (and see the different locations the reads are attempting to be assigned to), or whether this is within the normal range of acceptability.

Thanks again,

ADD REPLY • link 4.7 years ago by bertb ▴ 20

0

Entering edit mode

Try getting hold of a GTF file of repeat annotations, and quantifying against that.

ADD REPLY • link 4.7 years ago by i.sudbery 21k

0

Entering edit mode

Thanks for your reply.

I apologize if I sound like I keep repeating myself, but is there any way to identify the multiple locations that the reads are trying to assign to?

ADD REPLY • link 4.7 years ago by bertb ▴ 20

1

Entering edit mode

So the short answer to this is no. As far as I'm aware, no one has written anything that can tell you what things your multi-mapping reads are mapping to. Really, the only way to do this is to look for high mapping rates to things that one would expect to generate multi-mappers (hence my suggestion to count over repeat annotation).

You can also configure featureCounts to not mark things that map to mulitple transcripts as "ambiguous". Maybe that might help you to work out where these reads are coming from, but my guess would be that they are coming from some sort of repetitive sequence.

ADD REPLY • link 4.7 years ago by i.sudbery 21k

0

Entering edit mode

Ok, good to know. Thanks for your response. 2 quick points that maybe you can help with: 1. When I ran featureCounts with option -M to include multimappers in the count (see above for code & output), it got rid of all the multimappers category, but instead of assigning them to multiple genes, as I would expect, they seem to all get moved into the Unassigned_Ambiguity category. Is there a reason it's doing this, or am I missing something? And is there a way to just include primary mappers, and exclude all non-primary mappers? 2. Can you explain what you mean by looking for mapping rates over repeat annotations? As far as I know, the .gtf file I'm working with does not contain any repeat features, even in the rrna region. For example, if a multimapper maps to both gene a and gene b, then shouldn't featureCounts just assign a count of 1 to both genes?

Thanks again, this forum is so helpful

ADD REPLY • link 4.7 years ago by bertb ▴ 20

1

Entering edit mode

it got rid of all the multimappers category, but instead of assigning them to multiple genes, as I would expect, they seem to all get moved into the Unassigned_Ambiguity category. Is there a reason it's doing this, or am I missing something?

AFAIK Unassigned_Ambiguity means not that the read aligned to more than one gene, but that a single alignment overlapped more than one gene. So iether you have more than one gene overlapping the same location on the genome, or the two genes are close enough that a single read can span them both.

As far as I know, the .gtf file I'm working with does not contain any repeat features, even in the rrna region.

Yes. You'll generally have to get hold of a specific GTF file of repeat annotations for the genome in question. I don't precisely know where to get this for cerevisiae, but if i were looking I'd start with the UCSC table browser.

And is there a way to just include primary mappers,

Filter the BAM file with samtools first. Flag 256 is "not primary alignment".

ADD REPLY • link 4.7 years ago by i.sudbery 21k

0

Entering edit mode

Ok. That's a lot of helpful information. Thanks again

ADD REPLY • link 4.7 years ago by bertb ▴ 20

0

Entering edit mode

If you are open to trying a different aligner then you can try using bbmap.sh from BBMap suite. It has an option to handle ambiguous alignments for reads . Set it to all to get all possible sites where a read maps well to. You should also set secondary=t to print those alignments.

ambiguous=best          (ambig) Set behavior on ambiguously-mapped reads (with
                        multiple top-scoring mapping locations).
                            best    (use the first best site)
                            toss    (consider unmapped)
                            random  (select one top-scoring site randomly)
                            all     (retain all top-scoring sites)

ADD REPLY • link 4.7 years ago by GenoMax 152k