Sam And Reverse Complements
2
2
Entering edit mode
12.6 years ago
ilonpilaaja ▴ 50

Assume a read aligner (such as e.g. BWA) is fed with a pattern P = GACT. Now, the text that BWA has preprocssed is NNNAGTCNNN. The aligner would not find the pattern, and then would fall back to reverse-complementing it to AGTC and searching again. The question is, what it would report in the SAM-file? Would it output POS = 3, CIGAR=4M and FLAG with a bit 0x10 set? EDIT: To make the question complete and unambiguous: what if the text indexed is T = NNNAGTTNNN, so that actually we have to align the reverse complement with error? Is it so that CIGAR = 3M1S?

sam • 8.9k views
ADD COMMENT
8
Entering edit mode
12.6 years ago
Vikas Bansal ★ 2.4k

The sam output would be POS=4 and cigar=4m. I think your example should give-

P    16     chrname     4       Mappingquality      4M   .    .    .    AGCT       .   .   .   .   .

So the first column is your query name, 16 will tell you that read mapped to reverse strand (0 is for forward strand) , your reference sequence name (eg chromosome), position, mapping quality, cigar (4 alignments) , other 3 columns for mate reference, position and Tlen and your query sequence mentioned here will not be GACT , it will be the reverse compliment of it i:e AGCT . All fields for sam are described here.

EDIT: I am not getting your question. But I tried.

Ok, so I just ran BWA (default parameters) with genome file-

>vik 
NNNNNAGTTNNNNN

Reads -

>p 
GACT
>p2
AGTC

SAM out put :

@SQ     SN:vik  LN:14
p       16      vik     6       25      4M      *       0       0       AGTC    *       XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:3T0
p2      0       vik     6       25      4M      *       0       0       AGTC    *       XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:3T0

But why would you expect CIGAR = 3M1S . I think S is for soft clipping, like if you have aaaaaGCATCATG read, then some aligners will ignore them and CIGAR will start with 5S...

ADD COMMENT
0
Entering edit mode

Although this is an Q&A place rather than a forum, still I've updated my question to be more comprehensive and complete. Thank you.

ADD REPLY
0
Entering edit mode

I tried. See my edit.

ADD REPLY
0
Entering edit mode

Ah, something has short-circuited in my mind and I was assuming S is for "substitution". Now fixed.

ADD REPLY
2
Entering edit mode
12.6 years ago
ilonpilaaja ▴ 50

[Inspired by Vik's reply above] You may find these examples useful: Reference:

>vik
NNNNNNGTCCATGCAATTTTAAGACTTGAACCTGTGATCTGAAACCCCCTTGACTGATTACAGTCAGTNNNNN`

Query file:

>p
CATTACA
>q
TTAATC
>r
TGCTAATC
>s
GCATTACA
>t
CAATTTTTAAGACTTGAA
>u
GCAATTTAGACTTGAACCTGTGATCTG
>v
TCAATTTAGACTTGAACCTGTGATCTG
>w
AAGTCAAAATTGC

and run script ./bwa index ref.fa\\ ./bwa aln -n 100 -O 1 -E 1 ref.fa reads1.fq.gz > suff.sai\\ ./bwa samse ref.fa suff.sai reads1.fq.gz > aln.sam The output is (of which only the POS, SEQ, FLAG and CIGAR fields are of interest )

@SQ SN:vik  LN:73
@PG ID:bwa  PN:bwa  VN:0.6.1-r104
p   0   vik 56  20  7M  *   0   0   CATTACA *   XT:A:U  NM:i:1  X0:i:1  X1:i:2  XM:i:1  XO:i:0  XG:i:0  MD:Z:0G6    XA:Z:vik,-12,7M,2;vik,-33,7M,2;
q   16  vik 56  14  6M  *   0   0   GATTAA  *   XT:A:U  NM:i:1  X0:i:1  X1:i:8  XM:i:1  XO:i:0  XG:i:0  MD:Z:5C0
r   0   vik 12  0   8M  *   0   0   TGCTAATC    *   XT:A:R  NM:i:3  X0:i:4  X1:i:14 XM:i:3  XO:i:0  XG:i:0  MD:Z:3A1T1T0
s   0   vik 55  16  8M  *   0   0   GCATTACA    *   XT:A:U  NM:i:2  X0:i:1  X1:i:5  XM:i:2  XO:i:0  XG:i:0  MD:Z:0T0G6
t   0   vik 14  37  3M1I14M *   0   0   CAATTTTTAAGACTTGAA  *   XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:0  XO:i:1  XG:i:1  MD:Z:17
u   0   vik 13  37  7M2D20M *   0   0   GCAATTTAGACTTGAACCTGTGATCTG *   XT:A:U  NM:i:2  X0:i:1  X1:i:0  XM:i:0  XO:i:1  XG:i:2  MD:Z:7^TA20
v   0   vik 13  37  7M2D20M *   0   0   TCAATTTAGACTTGAACCTGTGATCTG *   XT:A:U  NM:i:3  X0:i:1  X1:i:0  XM:i:1  XO:i:1  XG:i:2  MD:Z:0G6^TA20
w   16  vik 13  37  8M2D5M  *   0   0   GCAATTTTGACTT   *   XT:A:U  NM:i:2  X0:i:1  X1:i:0  XM:i:0  XO:i:1  XG:i:2  MD:Z:8^AA5

To sum up: for reverse patterns only the FLAG field is affected. In the SEQ fields one reports the actual pattern (which is the pattern queried for for the "direct" case, and the pattern's reverse complement (rc) for the rc case)

ADD COMMENT

Login before adding your answer.

Traffic: 1569 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6