I have a really large bed file (24 GB) and I want to separate it into different files for every million base pairs. I have done it the pure programming way but this really slow
python
file=open(sys.argv[1], 'r')
lines=file.readlines()
#print lines
start=[]
end=[]
chr=[]
for a in lines:
a=a.rstrip()
a=a.split()
start.append(int(a[1]))
end.append(int(a[2]))
chr.append(a[0])
i_s = 0
while i_s < len(start):
i_e = i_s
while i_e < len(end):
if end[i_e] - start[i_s] > 1000000:
regions=open(str(counter)+'OI.txt', 'a')
for i in xrange(i_s,i_e):
regions.write(str(chr[i])+'\t'+str(start[i])+'\t'+str(end[i])+'\n')
i_s = i_e
regions.close()
break
i_e += 1
i_s += 1
R
args<-commandArgs(TRUE)
library(readr)
df=read_delim(args[1], delim='\t', col_names = F)
density=data.frame(0)
head(df)
i_s = 1
counter=1
while(i_s < nrow(df)){
i_e = i_s
while(i_e < nrow(df)){
if(df[i_e,3] - df[i_s,2] > 1000){
write.table(paste0(args[1], 'OI.txt'), sep=''\t', col.names=F, row.names=F, quote=F)
i_s = i_e
break
}
i_e = i_e + 1
}
i_s = i_s + 1
}
Both of these ways are extremely slow.... Does anyone know of a faster way to achieve this?
In python script, you are opening file multiple times until while loop ends. Put the open file statement above second while loop. It will increase a speed. Also, the input file is 24 GB, it will take some time.
To increase your speed, try to reduce while loops or use parallel approach.