Hi, I have the following toy SAM file entry.
1_7906 0 chr1 4771271 60 146M1D37M * 0 0 AAAAGTTTTTCTGCTTGGGGAAGAAGTTGCCCAGTATGACGGTGCATACAAGGTTAGCAGAGGCCTGTGGAAGAAATACGGTGACAAGAGGATCATCGACACCCCCATCTCAGAGATGGGCTTTGCTGGAATTGCTGTTGGTGCAGCTATGGCTGGGTTGCGGCCCATCTGTGAATTTATGAC FFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHJJJJJJJJJJJFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:1785 NM:i:2 NH:i:1 XI:f:0.9891 X0:i:1 XE:i:53 XR:i:183 MD:Z:146^C2G34 TC:i:0 RA:Z:44,0,0,0,0,0,35,0,0,0,1,0,56,0,0,0,0,0,47,0,0,0,0,0,0, MP:Z:10:149:150
It has RA
and MP
tags, but I need a script to get these tags in files that lack them, but have MD:Z
tag, CIGAR string, and read sequence.
RA
encodes the type of matches/mismathed in the read, as follows:
# Read
# A C G T N
# A 0 1 2 3 4
# R C 5 6 7 8 9
# e G 10 11 12 13 14
# f T 15 16 17 18 19
# N 20 21 22 23 24
So, RA:Z:44,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
would correspond to 44bp read with only As, and 100% match.
Now MP:
MP:Z:type of mismatch in RA:position in the reference:position in the read
MP:Z:2:10:10
means a mismatch A->C at position 10, and no indels before this position in the alignment (10 and 10)
I thought of doing it in Awk, in the following order:
1 convert cigar to a string equal to the read length (e.g., 44M -> [MMM..M]44)
2 get the mismatch positions from MD:Z
tag (using the above toy SAM entry: MD:Z:146^C2G34
gives 150=146+1del+2
(# of bases before the mismatch) + 1 (mismatch position), and the mismatch is G -> A)
3 use the converted cigar to get the corresponding reference index # (in this case one deletion shifts by 1 to the left, so 149) and find the type of the mismatched pair (A to G, C to T, etc.)
4 write the mismatch code to the respective position (10 in the toy case) in RA
tag
5 finalize RA
construction by adding the "matches", calculating each by subtracting total counts of each base and the mismatches thereof
Before I dive deep into Awk struggle, is that a good idea to use Awk here, and if not, what would be the better way to handle it (would be nice if it could process 50M reads in less than a couple of hrs)