Question

How does a skipped region from a CIGAR string (N) look in the alignment?

3

Entering edit mode

9.5 years ago

Niek De Klein ★ 2.6k

I want to know how a skipped region in the reference, or N in the CIGAR string, looks in the alignment. To try and explain what I mean I use the example provided from the SAM format specification (http://genome.sph.umich.edu/wiki/SAM), which does not include an N example:

Ref + read
RefPos:     1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T  G  A  A  C  T  G  A  C  T  A  A  C
Read: ACTAGAATGGCT

Alignment
RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T     G  A  A  C  T  G  A  C  T  A  A  C
Read:                   A  C  T  A  G  A  A     T  G  G  C  T

Cigar:
POS: 5
CIGAR: 3M1I3M1D5M

Now, in position 11 there is an insertion in the reference sequence. However, I would think that you can't distinguish between a skipped region or an insertion in the reference. Therefore the CIGAR string could also have been 3M1I3M1N5M

So how is it the alignment of a skipped region or an insertion in the reference sequence different? Is it only a skipped region if the C in position 11 is an N?

CIGAR • 5.7k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.5 years ago by Niek De Klein ★ 2.6k

Ram · Accepted Answer · 2015-05-28

4

Entering edit mode

9.5 years ago

Devon Ryan 104k

There's no a priori way to always distinguish between deletions (D CIGAR operations) and splicing (N CIGAR operations). In practice, most RNAseq aligners (e.g., tophat2 and STAR) have parameters with semi-arbitrary thresholds for the minimum intron size or maximum deletion size. In the case of STAR, any gap less than alignIntronMin (21 bases last I looked) is considered a deletion. I can't recall exactly what the tophat2 option for this is off-hand, but it's in there somewhere.

It's probably worth pointing out that the default values for these might be worth changing in some cases. I suspect that if someone were interested in splicing changes in cancer cells where there are a bunch of deletions that these parameters might need some tweaking (though presumably one would do WGS or WES alongside).

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.5 years ago by Devon Ryan 104k

0

Entering edit mode

It was not clear to me that the N CIGAR operations are supposed to represent introns, this makes sense now. Would it make sense to mask known introns in the reference sequence to make the D/N assignment less arbitrary? I just quickly looked for a paper on indel sizes and this one: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1557762/ mentions indels of up to 9989 in size, so if I understand you correctly, in case of the STAR default value any of the indels above 21 bases would wrongly be considered an intron?

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Niek De Klein ★ 2.6k

1

Entering edit mode

That's a reasonable approach. It should be noted that things are likely to function differently if one supplies a GTF file than if not. I would presume that if an annotation is available that STAR will look at that to determine possible splicing first, though you'd still be correct that any deletion (or Indel, as you pointed out) >= 21 bases should still be getting classified as a splicing event by default. One possible way around that would be to somehow specify that only annotated exon boundaries are allowed (STAR probably has an option for that already). Realistically speaking though, if people are really interested in finding indels they should probably just sequence the DNA rather than RNA. Then any apparent even like this will be a deletion and splicing events wouldn't ever occur.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Devon Ryan 104k