Question

Efficiently performing intersections for multiple subsets of a BED file

0

Entering edit mode

6.2 years ago

hersh ▴ 60

My goal is to find an efficient way to do multiple intersections very quickly. Here is my problem:

I am looking to do the equivalent operation to intersectBed -a file1.bed -b file2.bed except that I would like to do this for ~4000 different genes in file2.bed, and then store the results in a dictionary.

Currently I am trying do this using pybedtools, but am open to any solutions. In the pybedtools documentation, it says "In general, try to create as few BedTool objects as you can. Every time you create a BedTool object, you create a new open file. There is usually a BEDTools program that already does what you want, and will do it faster." (documentation). So instead of making bedtools inside a for loop, I made the follow type of loop:

file1 = pybedtools.BedTool('file1.bed')
file2 = pybedtools.BedTool('file2.bed')

for gene in genes:

    file2.filter(lambda bedtool : bedtool[3] == gene)

    dict[gene] = file1.intersect(file2)

Where I'm filtering for the coordinates with the gene ID, and then doing the intersection, and storing the results in a dictionary with the gene as a key, and the intersections as a value. This is super slow, and I think that there is probably a much better way to do this same task.

What is the recommended way for performing many intersections in a for loop, or getting an equivalent result without a loop?

Much appreciated!

sequence • 2.7k views

ADD COMMENT • link updated 6.2 years ago by Alex Reynolds 36k • written 6.2 years ago by hersh ▴ 60

Alex Reynolds · Answer 1 · 2018-09-11

0

Entering edit mode

6.2 years ago

Alex Reynolds 36k

It should be reasonably fast to do this on sort-bed-sorted inputs A.bed and B.bed:

$ for gene in `cat genes.txt`; do bedops --intersect A.bed <(awk -vgene=${gene} -vOFS="\t" -vFS="\t" '$4==gene' B.bed) | awk -vgene=${gene} -vOFS="\t" -vFS="\t" '{ print $0, gene }'  >> intersect.unsorted.bed; done
$ sort-bed intersect.unsorted.bed > intersect.sorted.bed

Gene IDs will be in the fourth column of intersect.sorted.bed.

You can read this file into a Python dictionary object, and then do stuff with that object, whatever that is:

#!/usr/bin/env python

import sys

"""
$ ./parseBed4.py < intersect.sorted.bed
"""

genes = {}
for line in sys.stdin:
  (chr, start, stop, id) = line.rstrip().split('\t')
  if id not in genes:
    genes[id] = []
  genes[id].append({'chr' : chr, 'start' : start, 'stop' : stop})

# do stuff with genes

Because intersect.sorted.bed is sorted, the records in genes[id] will be in sort order, also. This could be useful for binary searches, etc.

ADD COMMENT • link 6.2 years ago by Alex Reynolds 36k

0

Entering edit mode

Hey Alex thanks for your answer. I dowloaded bedops, and tried running the bash commands in your answer, but unfortunately I am getting the following error "awk: can't open file -vgene=cat source line number 1 awk: invalid -v option". I put the genes in a text file with one ID per line. Thanks, once I get this to work I hope that using the sorted option will make it faster.

ADD REPLY • link 6.2 years ago by hersh ▴ 60

0

Entering edit mode

Are you using the built-in awk on OS X? If so, you might use Homebrew to install GNU awk: brew install awk.

ADD REPLY • link 6.2 years ago by Alex Reynolds 36k

0

Entering edit mode

I installed awk using brew and now get "awk: can't open file -vgene=cat source line number 1 awk: syntax error at source line 1 context is >>> genes. <<< txt"

ADD REPLY • link 6.2 years ago by hersh ▴ 60

0

Entering edit mode

What happens if you run:

$ for gene in `cat genes.txt`; do echo ${gene}; done

Also, do you have any weird line endings in genes.txt? Perhaps copy and paste the output from running this so we can take a look:

$ cat -te genes.txt | head

ADD REPLY • link 6.2 years ago by Alex Reynolds 36k

0

Entering edit mode

XR_001781769.1$
XM_006520483.3$
NM_010601.3$
XM_006520484.2$
NM_016766.3$
NM_001164156.1$
XR_001781768.1$
XR_384436.3$
XR_384437.3$
XR_384438.3$

ADD REPLY • link updated 6.2 years ago by Alex Reynolds 36k • written 6.2 years ago by hersh ▴ 60

0

Entering edit mode

It prints the gene IDs without any weird endings

ADD REPLY • link 6.2 years ago by hersh ▴ 60

0

Entering edit mode

Can you do the same for A.bed and B.bed:

$ cat -te A.bed | head
...
$ cat -te B.bed | head
...

I'm reviewing things and I'm not sure what's wrong.

ADD REPLY • link 6.2 years ago by Alex Reynolds 36k

0

Entering edit mode

for A.bed:

chr15^I3057295^I3057329^IGGGCACTGAAACCCCATGATGCTGCTGTCAGCTTG^I46.85$ chr15^I3089214^I3089250^ITTTTCTCTGTGAGACCTGATAGGGATCCAAAGGAGGA^I42.31$ chr15^I3105599^I3105635^IGGACCTTAGCATGACTGTGCTACTCTCTGTTTAGTGC^I42.41$ chr15^I3130755^I3130794^ICATGTTTTGCAGTGTGAGGATGTGTTTAAGGTTACCTGTG^I42.19$ chr15^I3131618^I3131652^ICCCCACCTCAGCCTAATGAACTGGTCCTGAAGACT^I44.72$ chr15^I3132245^I3132279^IAAAATTGTGAGGCCATGTGGATGCACATGCACATG^I43.34$ chr15^I3133991^I3134026^ITGGCTGCTGTACTTCAGTTATGGCATCTCATCTCTG^I42.33$ chr15^I3136454^I3136492^IGAGTTGTGCTATGGTACAAGGGAAGAAAGTGAGCATTGT^I42.61$ chr15^I3138597^I3138632^ITTTTCTATATCGTGTTTGCTATGGAGGCACTGGGCT^I42.41$ chr15^I3139001^I3139035^ITGAAGGTAGTCCTTCCGGCGGTCTTCACGATACAG^I44.58$

for B.bed:

chr15^I99213319^I99213652^IXR_001781769.1_exon_0_0_chr15_99213320_r^I0^I-$ chr15^I99216165^I99216355^IXR_001781769.1_exon_1_0_chr15_99216166_r^I0^I-$ chr15^I99225146^I99225327^IXR_001781769.1_exon_2_0_chr15_99225147_r^I0^I-$ chr15^I99224359^I99225051^IXM_006520483.3_exon_0_0_chr15_99224360_f^I0^I+$ chr15^I99226307^I99226541^IXM_006520483.3_exon_1_0_chr15_99226308_f^I0^I+$ chr15^I99226980^I99227115^IXM_006520483.3_exon_2_0_chr15_99226981_f^I0^I+$ chr15^I99227890^I99228042^IXM_006520483.3_exon_3_0_chr15_99227891_f^I0^I+$ chr15^I99228472^I99228716^IXM_006520483.3_exon_4_0_chr15_99228473_f^I0^I+$ chr15^I99229058^I99229216^IXM_006520483.3_exon_5_0_chr15_99229059_f^I0^I+$ chr15^I99229305^I99229513^IXM_006520483.3_exon_6_0_chr15_99229306_f^I0^I+$

ADD REPLY • link 6.2 years ago by hersh ▴ 60

0

Entering edit mode

Okay, your input is a little different than I assumed.

Perhaps try a regular expression, instead:

$ for gene in `cat genes.txt`; do bedops --intersect A.bed <(awk -vgene=${gene} -vOFS="\t" -vFS="\t" '{ regex = "^"gene; if ($4 ~ regex) { print $0; } }' B.bed) | awk -vgene=${gene} -vOFS="\t" -vFS="\t" '{ print $0, gene }'  >> intersect.unsorted.bed; done

I tested the awk commands and they worked for me. Hope this helps in some way.

ADD REPLY • link 6.2 years ago by Alex Reynolds 36k