Is It Possible To Filter Only Bookend Reads From A Bed File?
2
0
Entering edit mode
10.9 years ago
Duarte Molha ▴ 240

I have a bed file with many fragments, some overlapping, some on their own and some adjacent to each other (book-ended) features.

I know can group overlapping and book-ended features using bedtools like

bedtools cluster -i fragments.bed

However I was wondering if anyone knew of a way of obtaining from the input file only the fragments that contain book-ended adjacent fragments.

Any ideas?

Best regards

bedtools filtering scripting • 3.7k views
ADD COMMENT
3
Entering edit mode
10.9 years ago

You can use the closest tool in bedtools, ignore overlapping features (-io) and require the distance (-d) to be 1. I would recommend using verion 2.18.2 or later.

$ cat i.bed
chr1    10    20    a
chr1    22    24    b
chr1    24    26    c
chr1    26    28    d
chr1    40    50    e
chr1    60    70    f
chr1    65    75    g
chr1    75    80    h

$ bedtools closest -a i.bed -b i.bed -io -d | awk '$9==1'
chr1    22    24    b    chr1    24    26    c    1
chr1    24    26    c    chr1    22    24    b    1
chr1    24    26    c    chr1    26    28    d    1
chr1    26    28    d    chr1    24    26    c    1
chr1    65    75    g    chr1    75    80    h    1
chr1    75    80    h    chr1    65    75    g    1

If you want a non-redundant set, just do the following:

$ bedtools closest -a i.bed -b i.bed -io -d | awk '$9==1' | awk '$2<$6'
chr1    22    24    b    chr1    24    26    c    1
chr1    24    26    c    chr1    26    28    d    1
chr1    65    75    g    chr1    75    80    h    1
ADD COMMENT
1
Entering edit mode
10.9 years ago
sjneph ▴ 690

This is pretty straightforward with BEDOPS:

sort-bed file.bed > file.sort.bed
closest-features --no-overlaps --dist --closest file.sort.bed file.sort.bed | awk -F'|' '$3*$3==1'

You basically ask what elements are closest 'to the left' and 'to the right' of every element in the file using an edge-2-edge measure (by running closest-features on the file against itself). If the distance is 1 or -1, then you have found an element that you're looking for. The output is still sorted per sort-bed so you can use immediately in further downstream BEDOPS operations.

Perhaps the only slightly tricky part is in using --no-overlaps. This ensures that elements declared closest 'to the left/right' are disjoint (which includes immediately adjacent elements) from the reference element. The awk math trick is needed because there is no built-in absolute value function in awk and closest-features returns the signed distance to the nearest element.

$ cat file.sort.bed
chr1    10  20  h
chr1    10  25  i
chr1    20  25  j
chr1    25  50  k
chr1    30  35  l
chr1    31  45  m
chr1    32  46  n

$ closest-features --no-overlaps --dist --closest file.sort.bed file.sort.bed | awk -F'|' '$3*$3==1'
chr1    10  20  h|chr1  20  25  j|1
chr1    10  25  i|chr1  25  50  k|1
chr1    20  25  j|chr1  10  20  h|-1
chr1    25  50  k|chr1  20  25  j|-1

since we used --closest, only the single nearest element is chosen, with ties 'going to the left'. It's also worth noting that closest-features has essentially no memory overhead and it's wicked fast. Use it with any size inputs without issue.

ADD COMMENT

Login before adding your answer.

Traffic: 1726 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6