Hi,
I have ~ 40 million short read mapping locations which I want to retrieve the reference genome sequence of (human) - so I have a table of chromosome, start, end and strand. At the moment I am using the BSgenome getSeq function on 500k chunks at a time, but it is taking a few hours to get through all 40 million - does anyone have a faster solution? (linux command line friendly)
Thanks, Aaron
EDIT - Here's my current R code, which reads in a csv with co-ordinates 500k lines at a time, and writes out the same csv with two extra columns containing the forward and reverse sequence - it took about 6 hrs to finish an 18 million line csv on a 8-core 64 bit linux box with 16gb ram
infile <- commandArgs(TRUE)[1]
outfile <- commandArgs(TRUE)[2]
width <- as.integer(commandArgs(TRUE)[3])
chunkSize <- 500000
require(BSgenome.Hsapiens.UCSC.hg18)
f1 <- file(infile, open="rt")
fOut <- file(outfile, open="wt")
repeat {
temp <- as.data.frame(do.call(rbind, strsplit(readLines(f1, n=chunkSize), split=",")), stringsAsFactors=FALSE)
if (nrow(temp)==0) break
colnames(temp) <- c("id","chr","fw","rv","strand", "MM")
temp$fw <- as.integer(temp$fw)
temp$rv <- as.integer(temp$rv)
temp$fwSeq=getSeq(Hsapiens, temp$chr, temp$fw, width=width, strand=temp$strand)
temp$rvSeq=getSeq(Hsapiens, temp$chr, temp$rv, width=width, strand=ifelse(temp$strand=="+", "-", "+"))
writeLines(gsub(" ","",apply(temp, 1, paste, collapse=",")), fOut)
}
close(f1)
close(fOut)
Hi Aaron, can you give us some code example to optimize, also what hardware are you using? For me, getting 1 million sequences out of Celegans takes about 3 sec, but that might depend on genome size.
Hmm hardware is definitely not an issue - however this is code I wrote a while ago so I'm not too sure why I only have it sucking in 500k at a time, R should handle an 18M line csv with 16gb RAM, however my awk solution I posted below will probably beat the pants off BSgenome any day I guess
I just tried reading the whole csv in, running and writing out the csv - It took about half an hour, so not much longer than the awk solution i posted below, but used ~14gb of RAM so I wouldn't have any wiggle room for larger files (which I will need in the future) so I'll stick with awk (even though its ugly and hard to debug)
Dont have to stick with it, it's just a matter of the io, let's beat awk...