Question

Create a new bed file with all pairwise combinations between two other bed files, based on bp distance

0

Entering edit mode

7 months ago

J ▴ 10

Hello,

I would greatly appreciate some help with this issue I'm having. Thank you in advance!

Summarizing the question in a post title was tricky, so I apologize if the title is unclear. I'll explain in detail below. To start, I have two bed files, one with non-overlapping regions of varying sizes, and one with SNP positions:

$ cat A.bed
chr1 65564 70008
chr1 944693 959240
chr1 4245331 4266330
chr1 17146107 17167757

$ cat B.bed
chr1 15715
chr1 1412732
chr1 1612000
chr1 16524041

The first bed file is ~2000 rows long, while the SNP file is almost 300,000 rows long. What I need to obtain are all of the pairwise comparisons/combinations between each row of the first bed file and each row of the second bed file, and only print these combinations to a new bed file if the SNP is +/- 250,000 bp from the start of the region in A.bed. So, the third bed file, C.bed, would contain all of these SNP-region pairs, as long as the SNP is within the specified distance of that region (250,000 bp upstream of the start of the region and 250,000 bp downstream of the start). If the SNP isn't close to any region, it is skipped.

I will be carrying this out separately for each chromosome, so the contents of the first column don't matter for the comparison. The output, C.bed, should have 4 columns.

It would look like this:

$ cat C.bed
chr1 15715 65564 70008
chr1 1412732 944693 959240
chr1 1612000 944693 959240
chr1 16524041 17146107 17167757

I tried finding the intersection of these bed files using bedtools intersect, but I couldn't figure out the right combination of parameters to use! Would this be easier in R? Thank you.

-J

SNPs BED eqtl bedtools • 1.4k views

ADD COMMENT • link 6 months ago by J ▴ 10

0

Entering edit mode

using bcftools

Show us what you tried

ADD REPLY • link 7 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thank you for replying Pierre.

I first added a third column to the SNP bed file, so that I would get a range of 1 for each SNP

awk 'BEGIN {OFS="\t"}{$3=$2+1; print $0}' B.bed > B2.bed

Next, I tried using bedtools window to find overlapping features between A.bed and B2.bed, but couldn't figure out how to restrict this to +/- from the start of A.bed regions only, and not both start and end (which is the default behaviour).

bedtools window -a A.bed -b B2.bed -w 250000

ADD REPLY • link 7 months ago by J ▴ 10

0

Entering edit mode

Next, I tried using bedtools window

but in your original post you said

tried finding the intersection of these bed files using bedtools intersect

write the full command lines, unfortunately we don't read in minds.

ADD REPLY • link 7 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

The command lines I wrote out for you are the ones I actually used. My original reference to bedtools intersect was because I assumed the bedtools window command was a component of bedtools intersect. It was simply a mistake; I'm new to this. I don't expect anyone to read my mind, hence my effort to make my question as detailed as possible. Not sure why I'm being met with such a snarky remark...

ADD REPLY • link 7 months ago by J ▴ 10

1

Entering edit mode

7 months ago

Pierre Lindenbaum 164k

bedtools intersect \
          -a <(sort -t $'\t' -k1,1 -k2,2n A.bed) \
          -b <(awk '{X=250000;P=int($2);printf("%s\t%d\t%d\t%d\n",$1,(P<X?0:P-X),P+X,P);}' B.bed  |   sort -t $'\t' -k1,1 -k2,2n) \
          -wa -wb |\
        cut -f 1,2,3,7

chr1    65564   70008   15715

in your example the lines

chr1 1412732 944693 959240
chr1 1612000 944693 959240
chr1 16524041 17146107 17167757

have a distance > 250000

ADD COMMENT • link 7 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thank you Pierre! I'll try this out with my full data set.

ADD REPLY • link 7 months ago by J ▴ 10

score 2 · Accepted Answer · 2024-04-26

I think Pierre's answer might be correct for your test input, but it doesn't account for the start position of the reference element and so might not be correct for all test or real-world inputs.

It may also be necessary to think about the case where both an upstream and downstream SNP are within your distance threshold for a given reference element. I don't think bedtools knows which one to pick?

Here is a BEDOPS-based answer that I think covers the case where the SNP and its distance is thresholded as measured from the start position, and the closest SNP is retrieved in the case where there are two hits.

But, first, this is not a BED file:

$ cat B.bed
chr1 15715
chr1 1412732
chr1 1612000
chr1 16524041

It is probably best to convert it to one, in order to work with it as a zero-based, half-indexed BED file:

$ awk -vFS="\t" -vOFS="\t" '{ print $1, ($2 - 1), $2 }' B.txt | sort-bed - > B.bed

This is not being pedantic. Formats and sort order are important when using BEDOPS or other BED set operation tools. What you are trying to do with these tools will not work if you start with bad inputs.

Once you do this step, you can use BEDOPS closest-features with A.bed and B.bed.

To start, here's what typical output looks like:

$ closest-features --dist A.bed B.bed
chr1    65564   70008|chr1  15714   15715|-49850|chr1   1412731 1412732|1342724
chr1    944693  959240|chr1 15714   15715|-928979|chr1  1412731 1412732|453492
chr1    4245331 4266330|chr1    1611999 1612000|-2633332|chr1   16524040    16524041|12257711
chr1    17146107    17167757|chr1   16524040    16524041|-622067|NA|NA

This tool reports the reference element (elements in A.bed) and each of the nearest SNPs from B.bed upstream and downstream of the bounds of the reference element (not the start position, but the start and end position), along with their respective signed distances from bound edges.

So there's enough information here to answer your question as written, but some additional awk work is necessary to process the start position test correctly, and deal with the case of multiple hits:

$ closest-features --dist A.bed B.bed \
    | awk -vFS="|" -vOFS="\t" -vTHRESHOLD=250000 \
        '{ \
            ref = $1; \
            split(ref, refarr, "\t"); \
            reflen = refarr[3] - refarr[2]; \
            lmap = $2; \
            ldist = -$3; \
            rmap = $4; \
            rdist = (rmap == "NA") ? 0 : $5 - reflen; \
            if (lmap == "NA") { \
                mindist = rdist; \
                minelem = rmap; \
            } \
            else if (rmap == "NA") { \
                mindist = ldist; \
                minelem = lmap; \
            } \
            else { \
                mindist = (ldist < rdist) ? ldist : rdist; \
                minelem = (ldist < rdist) ? lmap : rmap; \
            } \
            if (mindist <= THRESHOLD) { \
                print ref, minelem, mindist; \
            } \
        }' > answer.bed

Here's what my answer looks like:

$ cat answer.bed
chr1    65564   70008   chr1    15714   15715   49850

This can be munged into a text file with your format (not BED, though):

$ awk -vFS="\t" -vOFS="\t" '{ print $1, $6, $2, $3 }' answer.bed > answer.reformatted.txt
$ cat answer.reformatted.txt
chr1    15715   65564   70008

Note again that when two SNPs are candidates for a given region, this awk approach picks the closer of the two. You might have instead wanted the SNP further away (though still within the threshold), or you might have instead wanted to pick a SNP at random, in case of statistical analyses downstream. Something to think about before you use these results, perhaps.