Hello,
My RNAseq alignment QC is showing me that I have a fair number of "non primary hits", which, as I understand it, means the read has been mapped to multiple locations. (I created a very small subsample file of my .bam file in order to try and find where these reads are mapping.):
bam_stat.py -i file.bam
Load BAM file ... Done
#==================================================
#All numbers are READ count
#==================================================
Total records: 116
QC failed: 0
Optical/PCR duplicate: 0
Non primary hits 54
Unmapped reads: 21
mapq < mapq_cut (non-unique): 22
mapq >= mapq_cut (unique): 19
Read-1: 9
Read-2: 10
Reads map to '+': 9
Reads map to '-': 10
This seems to also be resulting in a high fraction of "Unassigned_MultiMapping" fragments when I run featureCounts on the file.
Is there a way to investigate or identify what regions these "non primary hits" are mapping (or trying to) map to?
I've tried running samtools view file.bam | grep -v NH:i:1 | perl -pe 's/AS.+(NH:i:\d+)/\$1/' | cut -f1,10,12 | perl -pe 's/NH:i://' | sort -u -k3,3nr > Multi-mapped.sorted.txt
just to find out which reads are multi-mapped and how many times, but the file produced only shows one fragment (which is says is mapped once) and so I suspect there isn't an NH tag on the alignment file. Does the hisat2 aligner, which I used, tag multi-mapped genes, or should I try realigning with something like STAR?
Thank you!
Non-primary may mean secondary (i.e., a read mapping equally well to multiple locations), or may mean supplementary (i.e., part of a read that cannot be represented as a linear alignment).
I am not familiar with hisat, but you should have shown the command you used for alignment, as it may help in interpreting what yu are seeing. Also, due to splicing, I believe RNAseq reads will often map as chimeric reads, thus your observation is perfectly normal.
For some background, see How should supplementary alignments be flagged?
Spliced reads aren't mapped as chimeric - the just have Ns in their cigar strings.
Thank you for your response.
The command I used for alignment was:
I understand that some degree of non primary alignments are expected, though I should note that I am working with yeast, which has very few introns and splicing events.
My concern is that when I attempt to perform a count with featureCounts, I have what seems like a very low assignment rate:
I also notice that when I allow multimapping, my assignment rate improves a little bit (still less than 20%), but a large amount of the Unassigned_MultiMapping reads appear to shift to Unassigned_Ambiguity category:
I have also run featureCounts both with and without rrna filters in the .gtf file, and I get similarly poor assignment rates (less than 15%).
This still seems like an unreasonably low assignment rate. I'm open to any recommendations on either how to investigate my low assignment rate (and see the different locations the reads are attempting to be assigned to), or whether this is within the normal range of acceptability.
Thanks again,
Try getting hold of a GTF file of repeat annotations, and quantifying against that.
Thanks for your reply.
I apologize if I sound like I keep repeating myself, but is there any way to identify the multiple locations that the reads are trying to assign to?
So the short answer to this is no. As far as I'm aware, no one has written anything that can tell you what things your multi-mapping reads are mapping to. Really, the only way to do this is to look for high mapping rates to things that one would expect to generate multi-mappers (hence my suggestion to count over repeat annotation).
You can also configure featureCounts to not mark things that map to mulitple transcripts as "ambiguous". Maybe that might help you to work out where these reads are coming from, but my guess would be that they are coming from some sort of repetitive sequence.
Ok, good to know. Thanks for your response. 2 quick points that maybe you can help with: 1. When I ran featureCounts with option -M to include multimappers in the count (see above for code & output), it got rid of all the multimappers category, but instead of assigning them to multiple genes, as I would expect, they seem to all get moved into the Unassigned_Ambiguity category. Is there a reason it's doing this, or am I missing something? And is there a way to just include primary mappers, and exclude all non-primary mappers? 2. Can you explain what you mean by looking for mapping rates over repeat annotations? As far as I know, the .gtf file I'm working with does not contain any repeat features, even in the rrna region. For example, if a multimapper maps to both gene a and gene b, then shouldn't featureCounts just assign a count of 1 to both genes?
Thanks again, this forum is so helpful
AFAIK
Unassigned_Ambiguity
means not that the read aligned to more than one gene, but that a single alignment overlapped more than one gene. So iether you have more than one gene overlapping the same location on the genome, or the two genes are close enough that a single read can span them both.Yes. You'll generally have to get hold of a specific GTF file of repeat annotations for the genome in question. I don't precisely know where to get this for cerevisiae, but if i were looking I'd start with the UCSC table browser.
Filter the BAM file with samtools first. Flag 256 is "not primary alignment".
Ok. That's a lot of helpful information. Thanks again
If you are open to trying a different aligner then you can try using
bbmap.sh
from BBMap suite. It has an option to handle ambiguous alignments for reads . Set it toall
to get all possible sites where a read maps well to. You should also setsecondary=t
to print those alignments.