I'm working with BEDPEs from 10x Genomic reads and I can't wrap my head around the meaning of the BEDPE format columns. Specifically start1/end1 and start2/end2. Generally structural variants that are discovered will have start and end coordinates that are only 1 bp different. That seems to make sense intuitively. But some will have a start2/end2 separated by several dozen or a hundred bp, which I don't really understand. If an inversion for example takes place then the "feature" coordinates described in these columns would be the breakpoint locations, but shouldn't the breakpoints only occur in 2 places compared to the reference genome? Why are there 4 values for breakpoint locations?
I cannot say for certain with your BEDPE data if this is similar, but it reminded me of how VCF represents structural variants using the "breakend specification" which also has 4 points on the reference genome for an inversion https://samtools.github.io/hts-specs/VCFv4.3.pdf
See section 5.4.7
Yes, thank you. This was my assumption was that the BEDPE coordinates represented a region of "likelihood" for a breakpoint or something similar. I saw one answer from a thread years ago that stated it was a confidence interval, but just couldn't find any documentation to confirm that. I guess I'll just continue to operate on that assumption.