Entering edit mode
5.8 years ago
O.rka
▴
740
I'm using pysam
to calculate the percent identity from each mapped read in a samfile. I tried to get some of the more complicated CIGAR strings so I did len(CIGAR) > 4
. I can't figure out how to calculate the percent identity from this. It would make sense to get the number of matches - the number of mismatches / alignment length but how can I get this from the CIGAR string?
How will indels affect the percent identity? Is there a method to do this straight from pysam
?
In [39]: import pysam
In [40]: samfile = pysam.AlignmentFile("mapped.sam")
In [41]: i = 0
...: for read in samfile.fetch():
...: if not read.is_unmapped:
...: cigar = read.cigarstring
...: if len(cigar) > 4:
...: print(read.cigarstring)
...: print(read.get_cigar_stats())
...: print(read.cigartuples)
...: print(read)
...: print()
...: i += 1
...: if i == 20:
...: break
5S55M
(array('I', [55, 0, 0, 0, 5, 0, 0, 0, 0, 0, 1]), array('I', [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]))
[(4, 5), (0, 55)]
HJGKNBGX5:1:11101:10032:11438 0 37 1411003 60 5S55M -1 -1 55 ACGTGAATTCCTGCACACTGTTGAAATCGTGGACAGTCAGACTGCCTGCTCAAATTGAAC array('B', [32, 32, 32, 32, 32, 36, 14, 27, 36, 14, 36, 32, 14, 36, 36, 14, 14, 36, 36, 32, 14, 32, 27, 36, 14, 36, 36, 36, 14, 36, 27, 32, 36, 36, 27, 14, 32, 27, 32, 32, 27, 36, 27, 14, 36, 36, 32, 27, 36, 36, 36, 36, 36, 36, 27, 36, 32, 36, 36, 32]) [('AS', -8), ('XN', 0), ('XM', 1), ('XO', 0), ('XG', 0), ('NM', 1), ('MD', '7C47'), ('YT', 'UU'), ('NH', 1)]
54M6S
(array('I', [54, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0]), array('I', [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]))
[(0, 54), (4, 6)]
HJGKNBGX5:1:11101:10044:3447 16 204 64576 60 54M6S -1 -1 54 ATCGAGAGACTGTAGCCGAAAGGGAAAACAGTGGTCGAGTTGCATAACGTCTGCTCATTC array('B', [36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 14, 36, 36, 32, 14, 36, 36, 36, 36, 36, 14, 36, 36, 36, 32, 14, 36, 36, 36, 27, 14, 36, 36, 36, 27, 36, 36, 32, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 32, 32, 32, 32, 32]) [('AS', -6), ('XN', 0), ('XM', 0), ('XO', 0), ('XG', 0), ('NM', 0), ('MD', '54'), ('YT', 'UU'), ('NH', 1)]
46M79N14M
(array('I', [60, 0, 0, 79, 0, 0, 0, 0, 0, 0, 3]), array('I', [2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]))
[(0, 46), (3, 79), (0, 14)]
HJGKNBGX5:1:11101:10029:1903 0 144 1949847 60 46M79N14M -1 -1 60 GAATAATTCGGTTACGTTGGCCCAGTTGACATCACAAAGCTCAAATGGAGAAAAGAAGCC array('B', [14, 21, 14, 14, 32, 14, 32, 36, 36, 14, 14, 14, 14, 14, 14, 27, 32, 36, 36, 32, 14, 14, 14, 14, 36, 14, 27, 32, 14, 14, 14, 36, 32, 14, 32, 36, 14, 14, 36, 36, 36, 36, 36, 27, 14, 14, 36, 36, 36, 14, 32, 14, 32, 36, 32, 14, 32, 36, 36, 27]) [('AS', -9), ('XN', 0), ('XM', 3), ('XO', 0), ('XG', 0), ('NM', 3), ('MD', '12C0C15A30'), ('YT', 'UU'), ('XS', '+'), ('NH', 1)]
you cannot use such cigar strings: The M operator can be a match(=) or a mismatch(X). See also the sam attribute 'NM:i:' ( "Edit distance to the reference" )
The blog post The history the MD tag and the CIGAR X operator explains why the CIGAR shouldn't be used to calculate percent identity, but a TL;DR is:
I've changed the question to how to calculate from sam file. Is there a straight forward way to do this from the sam record?
maybe you could use a combination of the MD field and cigar string, explained here https://zombieprocess.wordpress.com/2013/05/21/calculating-percent-identity-from-sam-files/
How should indels be handled?
Just out of curiosity, why do you need this?
This pipeline I use relies on CLC mapper which has a percent identity parameter. It seems really useful and I want to make a public version that only uses open source tools so I can throw it on my GitHub. I want to make it so you can use any mapping program and then you can filter out the reads that don't match a certain percent identity post hoc. For example, with environmental metagenomics you might want a lower threshold but with single cell stuff maybe you want it to be higher percent identity (maybe those are poor examples).
That would not be very simple to do with large BAM files. It may just be easier/faster to run alignments at two (or more) thresholds and let people pick one they like. Not sure how much difference this will make on final number of alignments. If you go through that exercise post the results here.
With BBMap for example you can set it while aligning:
Yea BBMap is definitely my go for 99% of my mapping jobs mostly because of this feature (and it's super fast, easy to use, and reliable). One thing about BBMap is it can't take in GTF/GFF3 files for exon mapping like HISAT2 & STAR.
You could post-filter your BBMap alignments using
samtools view regions
with bed files for the exons.Nice, I didn't know you could do that! So you're suggesting for eukaryotic organisms with lots of splice sites to map directly to transcript, then filter with samtools view regions?