BEDPE format explanation.

1

Entering edit mode

3.3 years ago

milesandersonmn ▴ 20

I'm working with BEDPEs from 10x Genomic reads and I can't wrap my head around the meaning of the BEDPE format columns. Specifically start1/end1 and start2/end2. Generally structural variants that are discovered will have start and end coordinates that are only 1 bp different. That seems to make sense intuitively. But some will have a start2/end2 separated by several dozen or a hundred bp, which I don't really understand. If an inversion for example takes place then the "feature" coordinates described in these columns would be the breakpoint locations, but shouldn't the breakpoints only occur in 2 places compared to the reference genome? Why are there 4 values for breakpoint locations?

structural variants • 1.8k views

ADD COMMENT • link 3.2 years ago by milesandersonmn ▴ 20

1

Entering edit mode

I cannot say for certain with your BEDPE data if this is similar, but it reminded me of how VCF represents structural variants using the "breakend specification" which also has 4 points on the reference genome for an inversion https://samtools.github.io/hts-specs/VCFv4.3.pdf

ADD REPLY • link 3.2 years ago by cmdcolin ★ 4.3k

1

Entering edit mode

See section 5.4.7

ADD REPLY • link 3.2 years ago by cmdcolin ★ 4.3k

0

Entering edit mode

Yes, thank you. This was my assumption was that the BEDPE coordinates represented a region of "likelihood" for a breakpoint or something similar. I saw one answer from a thread years ago that stated it was a confidence interval, but just couldn't find any documentation to confirm that. I guess I'll just continue to operate on that assumption.

ADD REPLY • link 3.2 years ago by milesandersonmn ▴ 20

Login before adding your answer.