I want to know how a skipped region in the reference, or N in the CIGAR string, looks in the alignment. To try and explain what I mean I use the example provided from the SAM format specification (http://genome.sph.umich.edu/wiki/SAM), which does not include an N example:
Ref + read
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Reference: C C A T A C T G A A C T G A C T A A C
Read: ACTAGAATGGCT
Alignment
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Reference: C C A T A C T G A A C T G A C T A A C
Read: A C T A G A A T G G C T
Cigar:
POS: 5
CIGAR: 3M1I3M1D5M
Now, in position 11 there is an insertion in the reference sequence. However, I would think that you can't distinguish between a skipped region or an insertion in the reference. Therefore the CIGAR string could also have been 3M1I3M1N5M
So how is it the alignment of a skipped region or an insertion in the reference sequence different? Is it only a skipped region if the C in position 11 is an N?
It was not clear to me that the N CIGAR operations are supposed to represent introns, this makes sense now. Would it make sense to mask known introns in the reference sequence to make the D/N assignment less arbitrary? I just quickly looked for a paper on indel sizes and this one: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1557762/ mentions indels of up to 9989 in size, so if I understand you correctly, in case of the STAR default value any of the indels above 21 bases would wrongly be considered an intron?
That's a reasonable approach. It should be noted that things are likely to function differently if one supplies a GTF file than if not. I would presume that if an annotation is available that STAR will look at that to determine possible splicing first, though you'd still be correct that any deletion (or Indel, as you pointed out) >= 21 bases should still be getting classified as a splicing event by default. One possible way around that would be to somehow specify that only annotated exon boundaries are allowed (STAR probably has an option for that already). Realistically speaking though, if people are really interested in finding indels they should probably just sequence the DNA rather than RNA. Then any apparent even like this will be a deletion and splicing events wouldn't ever occur.