How to Define Structural Variation Breakpoint Positions?
1
4
Entering edit mode
7.5 years ago

I want to limit the discussion to split-reads.

Say I have a split-read with the primary alignment as chr1:100-200 and the supplementary alignment as chr1:300-500.

Is the SV position I report chr1:200-300 or chr1-201-299

Or something else? I know there is a convention for this, and I wouldn't be asking if I haven't tried googling it.

Thanks!

Structural Variation SV Breakpoint • 2.2k views
ADD COMMENT
3
Entering edit mode
7.5 years ago
d-cameron ★ 2.9k

The SV position you report depends on the format you report in. I strongly recommend VCF as other formats such as BED and BEDPE are ambiguous for exactly the reason you raise. The encoding of variants into VCF records is covered in the Variant Call Format (VCF) specifications document at https://samtools.github.io/hts-specs/VCFv4.3.pdf. Section 5 is the section you will be interested in.

In your particular example, the variant supported by your split read depends on which, if any, of your split read alignments were aligned to the negative strand and the relative read position of the split reads (i.e. which end is soft/hard clipped). Your read may well support a tandem duplication of base chr1:100-500 inclusive if the CIGAR of your primary alignment is 200S100M and your supplementary is 200M100H.

ADD COMMENT
0
Entering edit mode

I should have been more explicit with the example. I was referring to a deletion on the positive strand for simplicity.

I get the tandem dup organization but do you mind clarifying a bit on

Your read may well support a tandem duplication of base chr1:100-500 inclusive if the CIGAR of your primary alignment is 200S100M and your supplementary is 200M100H.

Why would a tandem dup have 200S100M for the primary and the reciprocal for the supplementary. I though CIGAR strings were for the alignment not the entire read (primary+supplementary)

Edit: I think I figured it out. Here's an example of some split-reads that support a tandem dup

  • 104M46S SA:Z:8,2680602,-,100S50M,60,0;
  • 83M67S SA:Z:8,2680602,-,79S71M,60,0;
  • 72M78H SA:Z:8,2680602,+,68S82M,60,0;

The first two have soft clips so for both alignments, so they are included in the SEQ For the third example the primary has a hard clip (so it's not in the SEQ) but is soft clipped in the supplementary. For tandem dups you should always expect the matched bit of sequence to be min Left and max Right positions of the left and right positions of the split-reads. Deletions it's the max Left and min Right.

Thanks!

Edit edit: The answer for a deletion given the example above would be chr1-201-299

ADD REPLY
0
Entering edit mode

I was referring to a deletion on the positive strand for simplicity.

Terminology can be problematic for SVs. A deletion is not 'on the positive strand' - both strands are retained or deleted.

I though CIGAR strings were for the alignment not the entire read (primary+supplementary)

I used bwa-style split read alignments. The primary alignments are soft clipped, and the supplementary alignments are hard clipped. Hard clipped bases are not included in the read but are included in the CIGAR. Technically, the aligner is free to do what it wants which is really annoying when performing downstream analysis.

ADD REPLY

Login before adding your answer.

Traffic: 2582 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6