separating bed file into separate files for every 100000 base pairs

0

Entering edit mode

7.6 years ago

chrisclarkson100 ▴ 160

I have a really large bed file (24 GB) and I want to separate it into different files for every million base pairs. I have done it the pure programming way but this really slow

python

file=open(sys.argv[1], 'r')
lines=file.readlines()
#print lines
start=[]
end=[]
chr=[]
for a in lines:
    a=a.rstrip()
    a=a.split()
    start.append(int(a[1]))
    end.append(int(a[2]))
    chr.append(a[0])
i_s = 0
while i_s < len(start):
    i_e = i_s
    while i_e < len(end):
        if end[i_e] - start[i_s] > 1000000:
            regions=open(str(counter)+'OI.txt', 'a')
            for i in xrange(i_s,i_e):
                regions.write(str(chr[i])+'\t'+str(start[i])+'\t'+str(end[i])+'\n')
            i_s = i_e
            regions.close()
            break
        i_e += 1
    i_s += 1

R

args<-commandArgs(TRUE)
library(readr)
df=read_delim(args[1], delim='\t', col_names = F)
density=data.frame(0)
head(df)
i_s = 1
counter=1
while(i_s < nrow(df)){
  i_e = i_s
  while(i_e < nrow(df)){
    if(df[i_e,3] - df[i_s,2] > 1000){
      write.table(paste0(args[1], 'OI.txt'), sep=''\t', col.names=F, row.names=F, quote=F)
      i_s = i_e
      break
    } 
    i_e = i_e + 1
  }
  i_s = i_s + 1
}

Both of these ways are extremely slow.... Does anyone know of a faster way to achieve this?

Assembly • 2.8k views

ADD COMMENT • link 7.5 years ago by chrisclarkson100 ▴ 160

0

Entering edit mode

In python script, you are opening file multiple times until while loop ends. Put the open file statement above second while loop. It will increase a speed. Also, the input file is 24 GB, it will take some time.

To increase your speed, try to reduce while loops or use parallel approach.

ADD REPLY • link 7.6 years ago by Renesh ★ 2.2k

2

Entering edit mode

7.6 years ago

Kevin Blighe 89k

This is my own example, which may or may not be of use to you:

awk 'BEGIN {chr="chr1"; count=1; file=chr"Block"count".bed"; previouspos=$3} {sum+=(($3-$2)+1); if (sum<1000000 && $1==chr && (($3-previouspos)+1)<1000000) {print >> file}; if (sum>=1000000 || $1!=chr || (($3-previouspos)+1)>=1000000) {chr=$1; count+=1; file=chr"Block"count".bed"; print >> file; sum=(($3-$2)+1)} previouspos=$3} {if ($1!=chr) count=1}' MyBed.bed

It breaks up into blocks of 1,000,000bp and also by chromosome, and outputs these to separate files.

The only assumption you have to ensure is that the first entry begins with 'chr1'

ADD COMMENT • link 7.6 years ago by Kevin Blighe 89k

1

Entering edit mode

I see that my answer was accepted - thanks! However, there are critical assumptions that I make with my one-line awk command above.

Here is a much more robust version that does the following:

Sorts the BED file
Splits individual regions greater than 1 megabase into multiple regions
Breaks up the BED file into multiple BED files, each containing regions totalling 1 megabase or less in length

.

sort -k 1,1 -k2,2n MyBed.bed |
awk '{if ((($3-$2)+1)>1000000) {numintervals=int((((($3-$2)+1)/1000000)+1)); for (i=1; i<=numintervals; ++i) if (i==1) {print $1"\t"$2"\t"($2+999999)} else if (i==numintervals) {print $1"\t"$2+((999999+1)*(i-1))"\t"$3} else {print $1"\t"$2+((999999+1)*(i-1))"\t"($2+((999999+1)*(i-1)))+999999}} else {print}}' |
awk 'BEGIN {chr="chr1"; count=1; file=chr"Block"count".bed"; previouspos=$3} {sum+=(($3-$2)+1); if (sum<1000000 && $1==chr && (($3-previouspos)+1)<1000000) {print >> file}; if (sum>=1000000 || $1!=chr || (($3-previouspos)+1)>=1000000) {chr=$1; count+=1; file=chr"Block"count".bed"; print >> file; sum=(($3-$2)+1)} previouspos=$3} {if ($1!=chr) count=1}'

ADD REPLY • link 7.6 years ago by Kevin Blighe 89k

2

Entering edit mode

7.6 years ago

Alex Reynolds 36k

Use BEDOPS bedextract to quickly split the BED file into separate chromosomes:

$ for chr in `bedextract --list-chr monolithic.bed`; do bedextract ${chr} monolithic.bed > ${chr}.bed; done

Once you have per-chromosome files, you can parallelize the work very easily.

Here is a Python script that applies a first-fit algorithm for binning elements up to some maximum size (1e8 bases in the example in the Gist, but you can specify this).

	#!/usr/bin/env python

	''' Using first-fit bin packing algorithm implementation from: https://bitbucket.org/kent37/python-tutor-samples/src/f657aeba5328/BinPacking.py '''

	class Bin(object):
	def __init__(self):
	self.elements = []
	self.size = 0
	self.index = 0

	def append(self, element_key, element_value):
	self.elements.append(element_key)
	self.size += int(element_value)

	def __len__(self):
	len(self.elements)

	def __str__(self):
	return 'Bin(size=%d\n\tnumber_of_elements=%d\n\telements=%s)' % (self.size, int(len(self.elements)), self.elements)

	def pack(elements, maximum_bin_size):
	bins = []
	bin_count = 0

	for element in elements:
	elem_range = element[0]
	elem_size = element[1]
	for bin in bins:
	if bin.size + elem_size <= maximum_bin_size:
	bin.append(elem_range, elem_size)
	break
	else:
	bin = Bin()
	bin.append(elem_range, elem_size)
	bin.index = bin_count
	bins.append(bin)
	bin_count += 1

	return bins

	if __name__ == '__main__':
	import sys
	import numpy
	import random

	elements = {}
	bins = int(sys.argv[2])
	fudge = float(sys.argv[3])

	with open(sys.argv[1]) as f:
	for line in f:
	(chr, start, stop, size) = line.split()
	key = chr + ":" + start + "-" + stop
	value = int(size)
	elements[key] = value

	sorted_elements = sorted(elements.items(), key=lambda x: x[1], reverse=True)
	largest_element_size = sorted_elements[0][1]
	maximum_per_bin_size = int(numpy.ceil(largest_element_size / bins / fudge) * fudge)

	print '---------------------------------------------------'

	packed_bins = pack(sorted_elements, maximum_per_bin_size)
	print 'Via sorted elements -- using', len(packed_bins), 'of', bins, 'bins ( max_size:', maximum_per_bin_size,'):'
	bin_sizes = numpy.arange(len(packed_bins))
	for bin in packed_bins:
	print bin
	bin_sizes[bin.index] = bin.size
	print 'Size std dev --', numpy.std(bin_sizes)

	print '---------------------------------------------------'

	random_elements = sorted(elements.items(), key=lambda x: random.random())
	packed_bins = pack(random_elements, maximum_per_bin_size)
	print 'Via randomly shuffled elements -- using', len(packed_bins), 'of', bins, 'bins ( max_size:', maximum_per_bin_size,'):'
	bin_sizes = numpy.arange(len(packed_bins))
	for bin in packed_bins:
	print bin
	bin_sizes[bin.index] = bin.size
	print 'Size std dev --', numpy.std(bin_sizes)

view raw first_fit.py hosted with ❤ by GitHub

Ideally you would put the work of binning each chromosome onto separate compute nodes.

ADD COMMENT • link 7.6 years ago by Alex Reynolds 36k

0

Entering edit mode

Hi, thank you for this, In your GitHub example you input:

first_fit.py hg19.gapped_extents.bed 100 1e8

I have tried your example on my own data and it looks encouraging but I am looking to write the lines within a certain base pair range to separate files. This script does not seem to do that. The 'bin' objects seem to be in dictionaries with the chromosome as a key and range as a value... could you show me where in the code where I could modify it such that each of the elements in each bin are written to a separate file i.e. if bin1 has 20 elements could each of these be written to a '1.bed' and the same for bin2, 3.....

what difference does the input of 100 make (if I were to increase the bin size to 1000 would it make it faster)?

ADD REPLY • link 7.6 years ago by chrisclarkson100 ▴ 160

Login before adding your answer.