How does a skipped region from a CIGAR string (N) look in the alignment?
1
3
Entering edit mode
9.5 years ago
Niek De Klein ★ 2.6k

I want to know how a skipped region in the reference, or N in the CIGAR string, looks in the alignment. To try and explain what I mean I use the example provided from the SAM format specification (http://genome.sph.umich.edu/wiki/SAM), which does not include an N example:

Ref + read
RefPos:     1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T  G  A  A  C  T  G  A  C  T  A  A  C
Read: ACTAGAATGGCT

Alignment
RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T     G  A  A  C  T  G  A  C  T  A  A  C
Read:                   A  C  T  A  G  A  A     T  G  G  C  T

Cigar:
POS: 5
CIGAR: 3M1I3M1D5M

Now, in position 11 there is an insertion in the reference sequence. However, I would think that you can't distinguish between a skipped region or an insertion in the reference. Therefore the CIGAR string could also have been 3M1I3M1N5M

So how is it the alignment of a skipped region or an insertion in the reference sequence different? Is it only a skipped region if the C in position 11 is an N?

CIGAR • 5.7k views
ADD COMMENT
4
Entering edit mode
9.5 years ago

There's no a priori way to always distinguish between deletions (D CIGAR operations) and splicing (N CIGAR operations). In practice, most RNAseq aligners (e.g., tophat2 and STAR) have parameters with semi-arbitrary thresholds for the minimum intron size or maximum deletion size. In the case of STAR, any gap less than alignIntronMin (21 bases last I looked) is considered a deletion. I can't recall exactly what the tophat2 option for this is off-hand, but it's in there somewhere.

It's probably worth pointing out that the default values for these might be worth changing in some cases. I suspect that if someone were interested in splicing changes in cancer cells where there are a bunch of deletions that these parameters might need some tweaking (though presumably one would do WGS or WES alongside).

ADD COMMENT
0
Entering edit mode

It was not clear to me that the N CIGAR operations are supposed to represent introns, this makes sense now. Would it make sense to mask known introns in the reference sequence to make the D/N assignment less arbitrary? I just quickly looked for a paper on indel sizes and this one: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1557762/ mentions indels of up to 9989 in size, so if I understand you correctly, in case of the STAR default value any of the indels above 21 bases would wrongly be considered an intron?

ADD REPLY
1
Entering edit mode

That's a reasonable approach. It should be noted that things are likely to function differently if one supplies a GTF file than if not. I would presume that if an annotation is available that STAR will look at that to determine possible splicing first, though you'd still be correct that any deletion (or Indel, as you pointed out) >= 21 bases should still be getting classified as a splicing event by default. One possible way around that would be to somehow specify that only annotated exon boundaries are allowed (STAR probably has an option for that already). Realistically speaking though, if people are really interested in finding indels they should probably just sequence the DNA rather than RNA. Then any apparent even like this will be a deletion and splicing events wouldn't ever occur.

ADD REPLY

Login before adding your answer.

Traffic: 1654 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6