Fastest Way Of Extracting Millions Of Short Sequences From The Human Genome
6
7
Entering edit mode
14.6 years ago
Aaron Statham ★ 1.1k

Hi,

I have ~ 40 million short read mapping locations which I want to retrieve the reference genome sequence of (human) - so I have a table of chromosome, start, end and strand. At the moment I am using the BSgenome getSeq function on 500k chunks at a time, but it is taking a few hours to get through all 40 million - does anyone have a faster solution? (linux command line friendly)

Thanks, Aaron

EDIT - Here's my current R code, which reads in a csv with co-ordinates 500k lines at a time, and writes out the same csv with two extra columns containing the forward and reverse sequence - it took about 6 hrs to finish an 18 million line csv on a 8-core 64 bit linux box with 16gb ram

infile <- commandArgs(TRUE)[1]
outfile <- commandArgs(TRUE)[2]
width <- as.integer(commandArgs(TRUE)[3])
chunkSize <- 500000

require(BSgenome.Hsapiens.UCSC.hg18)
f1 <- file(infile, open="rt")
fOut <- file(outfile, open="wt")
repeat {
    temp <- as.data.frame(do.call(rbind, strsplit(readLines(f1, n=chunkSize), split=",")), stringsAsFactors=FALSE)
    if (nrow(temp)==0) break
    colnames(temp) <- c("id","chr","fw","rv","strand", "MM")
    temp$fw <- as.integer(temp$fw)
    temp$rv <- as.integer(temp$rv)
    temp$fwSeq=getSeq(Hsapiens, temp$chr, temp$fw, width=width, strand=temp$strand)
    temp$rvSeq=getSeq(Hsapiens, temp$chr, temp$rv, width=width, strand=ifelse(temp$strand=="+", "-", "+"))

    writeLines(gsub(" ","",apply(temp, 1, paste, collapse=",")), fOut)
}

close(f1)
close(fOut)
sequence bioconductor r • 9.8k views
ADD COMMENT
0
Entering edit mode

Hi Aaron, can you give us some code example to optimize, also what hardware are you using? For me, getting 1 million sequences out of Celegans takes about 3 sec, but that might depend on genome size.

ADD REPLY
0
Entering edit mode

Hmm hardware is definitely not an issue - however this is code I wrote a while ago so I'm not too sure why I only have it sucking in 500k at a time, R should handle an 18M line csv with 16gb RAM, however my awk solution I posted below will probably beat the pants off BSgenome any day I guess

ADD REPLY
0
Entering edit mode

I just tried reading the whole csv in, running and writing out the csv - It took about half an hour, so not much longer than the awk solution i posted below, but used ~14gb of RAM so I wouldn't have any wiggle room for larger files (which I will need in the future) so I'll stick with awk (even though its ugly and hard to debug)

ADD REPLY
0
Entering edit mode

Dont have to stick with it, it's just a matter of the io, let's beat awk...

ADD REPLY
8
Entering edit mode
14.6 years ago

Have you tried bedtools? I've never used that specific feature, but from what I hear, they're all pretty fast. Specifically, I think you're looking for the fastaFromBed command.

ADD COMMENT
3
Entering edit mode

Other suggestions: If you have access to a cluster, this problem is really easy to split, along either chromosomes, or in smaller chunks if you have more nodes. I'll also mention that BSgenome seemed pretty slow last time I tried to use it. Got a machine with a decent amount of memory? If so, you might be better off just rolling your own - read a whole chromosome's sequence into memory, then slice out the chunks you need (using Perl|Python|Ruby)

ADD REPLY
0
Entering edit mode

Yeah I ended up writing an awk script to grab the sequences (one chromosome at a time) and formatting them exactly how I wanted them - took about 20 mins to do the 40M, major speedup from R, and also uses barely any RAM (few hundred megs)

I haven't tested bedtools though, it may be faster

ADD REPLY
8
Entering edit mode
14.4 years ago

Hi Aaron, Michael,

FYI, I've made some improvements to the getSeq() function in BSgenome and to the write.XStringSet() function in Biostrings. With the latest devel() versions of Biostrings (2.17.20) and BSgenome (1.17.5), it takes less than 5 min and 2.5GB of RAM to extract 46 millions short sequences from the Human genome and write them to a FASTA file.

First generate 46 million short read mapping locations:

library(BSgenome.Hsapiens.UCSC.hg18)
generateRandomReadLocations <- function(x, density, readlength)
{
      if (!is.integer(readlength))
          readlength <- as.integer(readlength)
      start <- lapply(seqnames(x),
                      function(name)
                      {
                        seqlength <- seqlengths(x)[name]
                        sample(seqlength - readlength + 1L,
                               seqlength * density,
                               replace=TRUE)
                      })
      names <- rep.int(seqnames(x), elementLengths(start))
      ranges <- IRanges(start=unlist(start), width=readlength)
      strand <- strand(sample(c("+", "-"), length(names), replace=TRUE))
      GRanges(seqnames=names, ranges=ranges, strand=strand)
}
locs <- generateRandomReadLocations(Hsapiens, 0.015, 75)

Then extract the corresponding sequences by chunks of 2 millions and write them to a FASTA file (don't forget to specify as.character=FALSE when calling getSeq() or you will blow out R!):

breakInChunks <- function(totalsize, chunksize)
{
    quot <- totalsize %/% chunksize
    ans_width <- rep.int(chunksize, quot)
    rem <- totalsize %% chunksize
    if (rem > 0L)
        ans_width <- c(ans_width, rem)
    IRanges(end=cumsum(ans_width), width=ans_width)
}
extractAndWriteSequencesToFASTA <- function(x, locs, file)
{
    chunksize <- 2000000
    chunks <- breakInChunks(length(locs), chunksize)
    for (i in seq_len(length(chunks))) {
        chunk_start <- start(chunks)[i]
        chunk_end <- end(chunks)[i]
        cat("Extracting chunk ", i, " / ", length(chunks),
            " (", chunk_start, " to ", chunk_end, ") ... ", sep="")
        seqs <- getSeq(x, locs[chunk_start:chunk_end], as.character=FALSE)
        cat("OK. Writing to file ... ")
        write.XStringSet(seqs, file=file, append=TRUE)
        cat("OK.\n")
    }
}
extractAndWriteSequencesToFASTA(Hsapiens, locs, "test.fa")

Hope this helps! -- Hervé

ADD COMMENT
3
Entering edit mode
14.6 years ago
Aaron Statham ★ 1.1k

I ended up using awk to get the job done in about 20 mins for the same 18 million line file.

First an awk script to read a chromosomes fasta file into a single line, with no header and no line breaks - this is called readChr.awk

NR==1{next} {
    printf("%s",$0)
} END {printf("\n")}

Now I sort the reads by chromosome and position, then process the reads through awk which will load each chromosome as it is encountered, then use substr() to grab the 75bp for each forward and reverse read

sort --field-separator="," -k2,2 -k3,3n infile.csv | awk 'BEGIN {FS=","}
function revComp(temp) {
    for(i=length(temp);i>=1;i--) {
        tempChar = substr(temp,i,1)
        if (tempChar=="A") {printf("T")} else
        if (tempChar=="C") {printf("G")} else
        if (tempChar=="G") {printf("C")} else
        if (tempChar=="T") {printf("A")} else
        {printf("N")}
    }
} { #entry point
    if ($2!=chr) { #if hit a new chromosome, read it into chrSeq
        chr=$2
        "awk -f readChr.awk "chr".fa" | getline chrSeq
    }
    FW=toupper(substr(chrSeq,$3,75)) #retrieve forward sequence (75bp)
    RV=toupper(substr(chrSeq,$4,75)) #retrieve reverse sequence (75bp)
    printf("%s,%s,%s,%s,%s,%s,",$1,$2,$3,$4,$5,$6)
    if ($5=="+") {
        printf("%s,",FW)
        revComp(RV)
        printf("\n")
    } else {
        revComp(FW)
        printf(",%s\n",RV)
    }
}' > outfile.csv
ADD COMMENT
0
Entering edit mode

+1 for using gawk

ADD REPLY
2
Entering edit mode
14.6 years ago

The tools developped by Jim Kent are in http://hgwdev.cse.ucsc.edu/~kent/src/unzipped/utils/

for example see nibFrag or twoBitToFa, both are using the sequences indexed for blat.

for example twoBitToFa as an option -bed

twoBitToFa - Convert all or part of .2bit file to fasta
    (...)
    -bed=input.bed - grab sequences specified by input.bed. Will exclude introns
    (...)

Second option: for each chromosome, put the whole sequence in memory, parse your mapping file and print out the substring(start,end)

ADD COMMENT
0
Entering edit mode

Good idea about the UCSC tools, sadly they're deadly slow (it looks like it would take more than a day to do the 40M)

Hopefully someone can come up with a ready-make second option, otherwise I'll have to make it myself

ADD REPLY
0
Entering edit mode

twoBitToFa is too slow in it's current version. This command line does one million in a minute:

python -m twobitreader hg19.2bit < temp.bed

http://pythonhosted.org/twobitreader/

ADD REPLY
2
Entering edit mode
14.6 years ago
Michael 55k

Hi Aaron thanks for putting the example code. What can be optimized here has nothing to do with BSgenome, it's the way you handle the IO by using read.lines and then converting into a dataframe and writing out the sequences using write.lines that takes up the time. Try this:

# generate 1M random sequences
start = runif(1000000, min=1, max =length(Hsapiens$chr1)-200)
width = runif(1000000, min=30, max=200)
system.time(getSeq(Hsapiens, "chr1", start=start, width=width))
   user  system elapsed 
  8.236   0.572   8.750

On a Macbook pro with 2 GB,.

You should have enough memory to read in the whole 40 Mil, at least you can try. For reading very large datasets into R you should better use the scan function. That is much more efficient than read.table or read.lines, and do not convert to a data.frame. These can take up 10 times a much memory.

Here is a code-example of reading in your data efficiently using scan:

name.col = 1;
chrom.col = 2;
# .....
# given you have the column number n.col and the column numbers
what = as.list(rep(NULL, n.col))
what[[name.col]] = character(0) # if your match has a name
what[[chrom.col]] = character(0) #
what[[start.col]] = integer(0)
what[[end.col]] = integer(0)
what[[strand.col]] = character(0)
# if you want to chunk this, use the skip and nmax parameters in a loop
reads = scan (filename, what=what, sep=",", skip=0, nmax=-1)

result = getSeq(Hsapiens, names=reads[[chrom.col]], start=reads[[start.col]], end=reads[[end.col]] )

In fact, this should be even faster than all other solutions presented here.

ADD COMMENT
1
Entering edit mode

A small tip: with the latest R's version, you can read compressed files directly, with scan or read. This has been proven to be -faster- than reading a non-compressed file. ref: http://blog.revolution-computing.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html

ADD REPLY
0
Entering edit mode

Very nice improvement! If you can chose how to save the result, e.g. to save for further analysis in R, it might be best to save in R's binary format using save and load.

ADD REPLY
0
Entering edit mode

Thanks for the detailed reply! However, I don't see any time difference between using read.csv() or scan() on my actual file (both take ~6.5 mins to read in the 18 million lines). And also using R to grab the 18x2 million reads gobbles up 12gb of RAM, and takes 12 mins, then writing out the results takes another ~5 mins.

So timewise, R is as good as awk, however my awk script doesn't hit the RAM so hard (I think it peaks at ~4gb, though I haven't measured it properly yet) which means I don't have to worry about what other scripts I am running on this box at the same time. Cheers though!

ADD REPLY
0
Entering edit mode
8.3 years ago

You can use the twoBitReader python module:

python -m twobitreader hg19.2bit < temp.bed

http://pythonhosted.org/twobitreader/

ADD COMMENT

Login before adding your answer.

Traffic: 1938 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6