Adding information tags with types, positions and counts of matches and mismatches in every read in the SAM file
0
0
Entering edit mode
3.7 years ago
lechu ▴ 20

Hi, I have the following toy SAM file entry.

1_7906  0   chr1    4771271 60  146M1D37M   *   0   0   AAAAGTTTTTCTGCTTGGGGAAGAAGTTGCCCAGTATGACGGTGCATACAAGGTTAGCAGAGGCCTGTGGAAGAAATACGGTGACAAGAGGATCATCGACACCCCCATCTCAGAGATGGGCTTTGCTGGAATTGCTGTTGGTGCAGCTATGGCTGGGTTGCGGCCCATCTGTGAATTTATGAC FFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHJJJJJJJJJJJFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:1785   NM:i:2  NH:i:1  XI:f:0.9891 X0:i:1  XE:i:53 XR:i:183    MD:Z:146^C2G34  TC:i:0  RA:Z:44,0,0,0,0,0,35,0,0,0,1,0,56,0,0,0,0,0,47,0,0,0,0,0,0, MP:Z:10:149:150

It has RA and MP tags, but I need a script to get these tags in files that lack them, but have MD:Z tag, CIGAR string, and read sequence.

RA encodes the type of matches/mismathed in the read, as follows:

 #                   Read
 #             A     C     G     T     N
 #      A      0     1     2     3     4
 # R    C      5     6     7     8     9
 # e    G     10    11    12    13    14
 # f    T     15    16    17    18    19
 #      N     20    21    22    23    24

So, RA:Z:44,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, would correspond to 44bp read with only As, and 100% match.

Now MP:

MP:Z:type of mismatch in RA:position in the reference:position in the read

MP:Z:2:10:10 means a mismatch A->C at position 10, and no indels before this position in the alignment (10 and 10)

I thought of doing it in Awk, in the following order:

1 convert cigar to a string equal to the read length (e.g., 44M -> [MMM..M]44)

2 get the mismatch positions from MD:Z tag (using the above toy SAM entry: MD:Z:146^C2G34 gives 150=146+1del+2 (# of bases before the mismatch) + 1 (mismatch position), and the mismatch is G -> A)

3 use the converted cigar to get the corresponding reference index # (in this case one deletion shifts by 1 to the left, so 149) and find the type of the mismatched pair (A to G, C to T, etc.)

4 write the mismatch code to the respective position (10 in the toy case) in RA tag

5 finalize RA construction by adding the "matches", calculating each by subtracting total counts of each base and the mismatches thereof

Before I dive deep into Awk struggle, is that a good idea to use Awk here, and if not, what would be the better way to handle it (would be nice if it could process 50M reads in less than a couple of hrs)

RNA-Seq • 804 views
ADD COMMENT

Login before adding your answer.

Traffic: 1843 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6