I will need to create summaries for a bunch of SAM files produced by Tophat (RNA-Seq spliced mapping). Are there any tools /scripts out there which will give me stats for unique vs non unique matches, spliced vs non_spliced mapping and finally split it into number of mismatches?
EDIT I collected all tags from my SAM file and feed it to explain_sam_flags.py (you can get it from Picard source: http://picard.svn.sourceforge.net/viewvc/picard/trunk/src/scripts/
None of my tags from Tophat's accepted_hits.sam has "not primary alignment". Looks like Tophat reports only unique matches, which is OK for me. Can somebody confirm this?
EDIT 2 "Tophat reports only unique matches" can not be true. Sequence like below have "0" flag. "CAACAACAGCAACAACAACAGCAACAGCAACAGCAACAGCAACAGCAACAACAA". Puzzling.
EDIT 3 (SAM example)
8_96_444_1622 73 scaffold00005 155754 255 54M * 0 0 ATGTAAAGTATTTCCATGGTACACAGCTTGGTCGTAATGTGATTGCTGAGCCAG BC@B5)5CBBCCBCCCBC@@7C>CBCCBCCC;57)8(@B@B>ABBCBC7BCC=> NM:i:0
8_80_1315_464 81 scaffold00005 155760 255 54M = 154948 0 AGTACCTCCCTGGTACACAGCTTGGTAAAAATGTGATTGCTGAGCCAGACCTTC B?@?BA=>@>>7;ABA?BB@BAA;@BBBBBBAABABBBCABAB?BABA?BBBAB NM:i:0
8_17_1222_1577 73 scaffold00005 155783 255 40M1116N10M * 0 0 GGTAAAAATGTGATTGCTGAGCCAGACCTTCATCATGCAGTGAGAGACGC BB@BA??>CCBA2AAABBBBBBB8A3@BABA;@A:>B=,;@B=A:BAAAA NM:i:0 XS:A:+ NS:i:0
8_43_1211_347 73 scaffold00005 155800 255 23M1116N27M * 0 0 TGAGCCAGACCTTCATCATGCAGTGAGAGACGCAAACATGCTGGTATTTG #>8<=<@6/:@9';@7A@@BAAA@BABBBABBB@=<A@BBBBBBBBCCBB NM:i:2 XS:A:+ NS:i:0
8_32_1091_284 161 scaffold00005 156946 255 54M = 157071 0 CGCAAACATGCTGGTAGCTGTGACACCACATCAACAGCTTGACTATGTTTGTAA BBBBB@AABACBCA8BBBBBABBBB@BBBBBBA@BBBBBBBBBA@:B@AA@=@@ NM:i:0
Two reads: 8_17_1222_1577 and 8_43_1211_347 are spliced.
My second column tags are: 65 73 81 83 97 99 113 115 129 137 145 147 161 163 177
I just stumbled over the difference between BAM (binary) and SAM (text) formats. Can you give a shortened example of your file?