The SV position you report depends on the format you report in. I strongly recommend VCF as other formats such as BED and BEDPE are ambiguous for exactly the reason you raise. The encoding of variants into VCF records is covered in the Variant Call Format (VCF) specifications document at https://samtools.github.io/hts-specs/VCFv4.3.pdf. Section 5 is the section you will be interested in.
In your particular example, the variant supported by your split read depends on which, if any, of your split read alignments were aligned to the negative strand and the relative read position of the split reads (i.e. which end is soft/hard clipped). Your read may well support a tandem duplication of base chr1:100-500 inclusive if the CIGAR of your primary alignment is 200S100M and your supplementary is 200M100H.
I should have been more explicit with the example. I was referring to a deletion on the positive strand for simplicity.
I get the tandem dup organization but do you mind clarifying a bit on
Your read may well support a tandem duplication of base chr1:100-500 inclusive if the CIGAR of your primary alignment is 200S100M and your supplementary is 200M100H.
Why would a tandem dup have 200S100M for the primary and the reciprocal for the supplementary. I though CIGAR strings were for the alignment not the entire read (primary+supplementary)
Edit: I think I figured it out. Here's an example of some split-reads that support a tandem dup
104M46S SA:Z:8,2680602,-,100S50M,60,0;
83M67S SA:Z:8,2680602,-,79S71M,60,0;
72M78H SA:Z:8,2680602,+,68S82M,60,0;
The first two have soft clips so for both alignments, so they are included in the SEQ
For the third example the primary has a hard clip (so it's not in the SEQ) but is soft clipped in the supplementary. For tandem dups you should always expect the matched bit of sequence to be min Left and max Right positions of the left and right positions of the split-reads. Deletions it's the max Left and min Right.
Thanks!
Edit edit: The answer for a deletion given the example above would be chr1-201-299
I was referring to a deletion on the positive strand for simplicity.
Terminology can be problematic for SVs. A deletion is not 'on the positive strand' - both strands are retained or deleted.
I though CIGAR strings were for the alignment not the entire read (primary+supplementary)
I used bwa-style split read alignments. The primary alignments are soft clipped, and the supplementary alignments are hard clipped. Hard clipped bases are not included in the read but are included in the CIGAR. Technically, the aligner is free to do what it wants which is really annoying when performing downstream analysis.
I should have been more explicit with the example. I was referring to a deletion on the positive strand for simplicity.
I get the tandem dup organization but do you mind clarifying a bit on
Why would a tandem dup have 200S100M for the primary and the reciprocal for the supplementary. I though CIGAR strings were for the alignment not the entire read (primary+supplementary)
Edit: I think I figured it out. Here's an example of some split-reads that support a tandem dup
The first two have soft clips so for both alignments, so they are included in the SEQ For the third example the primary has a hard clip (so it's not in the SEQ) but is soft clipped in the supplementary. For tandem dups you should always expect the matched bit of sequence to be min Left and max Right positions of the left and right positions of the split-reads. Deletions it's the max Left and min Right.
Thanks!
Edit edit: The answer for a deletion given the example above would be
chr1-201-299
Terminology can be problematic for SVs. A deletion is not 'on the positive strand' - both strands are retained or deleted.
I used bwa-style split read alignments. The primary alignments are soft clipped, and the supplementary alignments are hard clipped. Hard clipped bases are not included in the read but are included in the CIGAR. Technically, the aligner is free to do what it wants which is really annoying when performing downstream analysis.