A paper for RNAseq alignment-free isoform quantification, with the software called sailfish, was recently posted to arXiv. Discussion here, here and here. The method is "18 to 29 times faster than the next fastest method while providing expression estimates of equal accuracy" by using counts of k-mers instead of mapping reads.
Looks promising - any thoughts or experience with this? How quickly (if at all?) should we be switching from our current expression quantification pipelines?
Well, I tried it out! It still looks promising. I just ran it on some yeast data and compared it a bit to manually calculated RPKMs and some from cufflinks. It was really fast and the RPKMs look good -- cufflinks is the oddball here, while the sailfish RPKMs are closer to the manually calculated ones (from tophat counts with uniquely aligned reads). The bias correction looks a bit odd for one sample, so I'd look into that a bit more before I used that.
I assume the researchers we work with might feel more comfortable using something like this once it's actually published and everything, but it seems worth a try to me. Especially for cases where people are interested in a fairly simple known gene expression type of analysis, this could offer a nice (quick) quantification.
Of course, anytime you switch to a new method, it may be good to do both for awhile just so you notice if anything is really different.
Edited to add -- The following was my fault, do not worry about it. Note: I initially tried it on some human data and building the index was taking too long, so I switched to the yeast data set. Perhaps I was being too polite with our shared server and it would have sped up more if I asked for more threads. But I let it run for 5 days or so before I gave up. Of course, this step only has to be done once for a given transcriptome and k-mer size.
Hi Madelaine, I'm one of authors of Sailfish. Thanks for trying the software! We'd be interested to hear about any experiences that you had running Sailfish. As you know, it's in beta so we're still trying to squash bugs and work out kinks. Would you mind mentioning some info on the transcriptome for which you were building the index? Building the index is linear in the number of distinct kmers, but we've noticed that if you have a very large transcriptome, the memory requirements for building the perfect hash can climb pretty high and any paging out to disk could slow things down a lot. The reason I ask is because the transcriptome we used for the synthetic tests in our paper contained about 105K transcripts and building the index only took 5-10 minutes. The memory overhead of the hash itself, once built, is fairly small, but there is a significant constant-factor memory overhead involved in constructing a minimum perfect hash. We're currently looking into an out-of-memory index construction procedure that is potentially more suitable to large transcriptomes and could be used in such cases. Thanks again for giving our software a try!
Hey, thanks for responding! I was using ensembl 72 human transcripts, there are ~ 194k transcripts in my fasta file. I was working on a server with a ton of memory... After I posted this, I saw your google group and read you were working on large transcriptomes.
I just took a look and my transcript fasta is 6 gigs. Didn't realize it was that beefy... I think this must be a pretty inclusive transcriptome, someone else in my group actually constructed it. Here it is if you want to take a look.
Hi Madelaine. Yes, I was just going to e-mail you regarding this as I was perplexed by it. It looks like the file you linked has the full (unspliced) genomic DNA of all of the transcripts. If you grab the ensemble 72 human reference cDNA (same set of transcripts), it's about 317Mb; it shouldn't take more than 10-20 minutes to build the index on that.
Well, I thought I'd try it. Then I started installing cmake. Then I found out I needed a newer version of g++. I'll get back to you.
Well, I gave up on trying to build from source... But I just ran the executable on the sample data.