My goal is to find an efficient way to do multiple intersections very quickly. Here is my problem:
I am looking to do the equivalent operation to intersectBed -a file1.bed -b file2.bed
except that I would like to do this for ~4000 different genes in file2.bed, and then store the results in a dictionary.
Currently I am trying do this using pybedtools, but am open to any solutions. In the pybedtools documentation, it says "In general, try to create as few BedTool objects as you can. Every time you create a BedTool object, you create a new open file. There is usually a BEDTools program that already does what you want, and will do it faster." (documentation). So instead of making bedtools inside a for loop, I made the follow type of loop:
file1 = pybedtools.BedTool('file1.bed')
file2 = pybedtools.BedTool('file2.bed')
for gene in genes:
file2.filter(lambda bedtool : bedtool[3] == gene)
dict[gene] = file1.intersect(file2)
Where I'm filtering for the coordinates with the gene ID, and then doing the intersection, and storing the results in a dictionary with the gene as a key, and the intersections as a value. This is super slow, and I think that there is probably a much better way to do this same task.
What is the recommended way for performing many intersections in a for loop, or getting an equivalent result without a loop?
Much appreciated!
Hey Alex thanks for your answer. I dowloaded bedops, and tried running the bash commands in your answer, but unfortunately I am getting the following error "awk: can't open file -vgene=cat source line number 1 awk: invalid -v option". I put the genes in a text file with one ID per line. Thanks, once I get this to work I hope that using the sorted option will make it faster.
Are you using the built-in
awk
on OS X? If so, you might use Homebrew to install GNU awk:brew install awk
.I installed awk using brew and now get "awk: can't open file -vgene=cat source line number 1 awk: syntax error at source line 1 context is >>> genes. <<< txt"
What happens if you run:
Also, do you have any weird line endings in
genes.txt
? Perhaps copy and paste the output from running this so we can take a look:It prints the gene IDs without any weird endings
Can you do the same for
A.bed
andB.bed
:I'm reviewing things and I'm not sure what's wrong.
for A.bed:
chr15^I3057295^I3057329^IGGGCACTGAAACCCCATGATGCTGCTGTCAGCTTG^I46.85$ chr15^I3089214^I3089250^ITTTTCTCTGTGAGACCTGATAGGGATCCAAAGGAGGA^I42.31$ chr15^I3105599^I3105635^IGGACCTTAGCATGACTGTGCTACTCTCTGTTTAGTGC^I42.41$ chr15^I3130755^I3130794^ICATGTTTTGCAGTGTGAGGATGTGTTTAAGGTTACCTGTG^I42.19$ chr15^I3131618^I3131652^ICCCCACCTCAGCCTAATGAACTGGTCCTGAAGACT^I44.72$ chr15^I3132245^I3132279^IAAAATTGTGAGGCCATGTGGATGCACATGCACATG^I43.34$ chr15^I3133991^I3134026^ITGGCTGCTGTACTTCAGTTATGGCATCTCATCTCTG^I42.33$ chr15^I3136454^I3136492^IGAGTTGTGCTATGGTACAAGGGAAGAAAGTGAGCATTGT^I42.61$ chr15^I3138597^I3138632^ITTTTCTATATCGTGTTTGCTATGGAGGCACTGGGCT^I42.41$ chr15^I3139001^I3139035^ITGAAGGTAGTCCTTCCGGCGGTCTTCACGATACAG^I44.58$
for B.bed:
chr15^I99213319^I99213652^IXR_001781769.1_exon_0_0_chr15_99213320_r^I0^I-$ chr15^I99216165^I99216355^IXR_001781769.1_exon_1_0_chr15_99216166_r^I0^I-$ chr15^I99225146^I99225327^IXR_001781769.1_exon_2_0_chr15_99225147_r^I0^I-$ chr15^I99224359^I99225051^IXM_006520483.3_exon_0_0_chr15_99224360_f^I0^I+$ chr15^I99226307^I99226541^IXM_006520483.3_exon_1_0_chr15_99226308_f^I0^I+$ chr15^I99226980^I99227115^IXM_006520483.3_exon_2_0_chr15_99226981_f^I0^I+$ chr15^I99227890^I99228042^IXM_006520483.3_exon_3_0_chr15_99227891_f^I0^I+$ chr15^I99228472^I99228716^IXM_006520483.3_exon_4_0_chr15_99228473_f^I0^I+$ chr15^I99229058^I99229216^IXM_006520483.3_exon_5_0_chr15_99229059_f^I0^I+$ chr15^I99229305^I99229513^IXM_006520483.3_exon_6_0_chr15_99229306_f^I0^I+$
Okay, your input is a little different than I assumed.
Perhaps try a regular expression, instead:
I tested the
awk
commands and they worked for me. Hope this helps in some way.