Forum:The new era of bioinformatics: simple and fast tools, fewer and more informative data
3
20
Entering edit mode
9.5 years ago

It was with great joy that I've read the announcement by Lior Pachter that they have developed Kallisto a tool that performs Near-optimal RNA-Seq quantification with a fraction of the computational resources that were needed before.

In parallel the MinION platform promises average read lengths of over 15kb (up to 160Kb) in a simpler, far smaller and more easily accessible format. The era of ??X coverage will be over when we can directly measure substantial segments of the genome.

I see the writing on the wall. We should not need massive infrastructure, CPU farms, cloud computing, system admins to do bioinformatics. We could just use a laptop or even a phone. Biology is immensely complicated, cutting out the unnecessary complexities of processing and managing data will be beneficial to everyone.

These are the technologies that truly democratize genomics.

kallisto computational-cost • 6.0k views
ADD COMMENT
1
Entering edit mode

By the writing on your wall, looks like I'll be out of a job soon if I don't evolve :)

ADD REPLY
4
Entering edit mode

As a scientist, you're always out of a job if you don't evolve. :)

ADD REPLY
2
Entering edit mode

well it all depends ;-) but I don't think being out of job will be a threat though - it is more like running cuffdiff is not what bioinformatics is supposed to be

ADD REPLY
1
Entering edit mode

On the other hand, even small labs are doing dozens of whole genomes, requiring more infrastructure and analysts who are more than technicians.

ADD REPLY
0
Entering edit mode

MinION? Last time I checked it had crazy error rates along with lots of chimeric sequences (we found sequences that were not supposed to be there - with their control lambda DNA!). Has this thing improved since then?

ADD REPLY
1
Entering edit mode

it is an early platform and as such may have many problems. Users report notable improvements in quality with new chemistry releases, I will have the chance to evaluate it myself very soon.

But even today I can see that once we start dealing with 160Kb reads everything that we think we know about bioinformatics needs to be reevaluated - it is like a naked emperor situation - what we call bioinformatics is really working around the limitations of producing billions of very short reads. What we really want is one read that corresponds to the chromosome, the transcript, the bound DNA, the actual unit of DNA under study etc.

Come to think of it is a BAM file suited to represent a 160Kb alignment? Not so clear. Remember BAM is supposed to act like a indexed database to query an interval from the hundreds of millions entries. But if I have only a few thousand why bother with that.

It may sound ridiculous to say that: well that BAM files did not exist 10 years ago and perhaps it will not be used 10 years from now - but I think it will come true.

ADD REPLY
6
Entering edit mode

SAM/BAM was designed even with whole genome alignment in mind. Multiple segments per read, hard clipping and R-tree index were particularly geared towards long reads. In the era of 35bp reads, we don't need these features. SAM has issues with long reads because we lacked use cases at that time, but these issues are hard to solve anyway. If I redesign a new long-read alignment format from scratch, I am not sure I can do much better. That said, I also believe we will be doing alignment less and less in future. SAM will die ultimately. BAM will die sooner. Personally, I have been looking forward to long reads when I designed SAM/BAM, developed bwa-sw and later bwa-mem. Long reads will rule. It is just a matter of time.

ADD REPLY
1
Entering edit mode

On thing that should/will change IMO is that that once we will measure (really) long sequences we could end up in situations where line oriented I/O is not efficient anymore. We can't process files where the entire sequence for chr1 is placed on a single line.

ADD REPLY
0
Entering edit mode
ADD REPLY
9
Entering edit mode
9.5 years ago
John 13k

I would say this is not something unique to Bioinformatics, rather, just the direction that software development is heading in general.

Back in the old days when processors only had limited capacity, software developers needed to write their code as efficiently as possible, often optimising it for the CPU architecture it was supposed to run on. Speed of execution was king.

As processors became exponentially more powerful, and compilers became as good if not better than hand-tuned C, speed of execution took a back bench to speed of development. Higher level languages which abstracted complexity away became (and still are) incredibly popular.

Now that development time for even the most complicated apps is measured in weeks/months and not years, focus has shifted from development time back to execution time, but this hasnt been easy because it means giving up a lot of that abstraction current-day developers are used to. For example, it is often 1000x more performant to used typed arrays (Cython/numpy) for example, than whatever 16byte blobs things Python's data objects are typically stored as - but this speed comes with restrictions, problems, and a certain level of technical expertise which all costs money to whoever is funding the project.

When internet adoption exploded, Facebook/Twitter/Google, etc didnt have time to redesign programming paradigms, and 'solved' the problem of scale by parallelization. Hadoop, map/reduce, Google Bigdoc, etc. Huge data centres, server clusters, etc etc. I think its pretty well established these days among HPC experts that this was a bad trend for everyone else to follow. Often the structure of Hadoop overshadows the fact that the code and messaging that Google uses is, itself, extremely well written to begin with. But other developers picked it up because it allows them to offload their work onto operations - the people who buy/maintain the hardware the code runs on. Code not fast enough? Buy more Solid State Hard Drives! I will fix my code as a last-resort.

Paypal recently rewrote their entire codebase from Java to Node.js, which has caused a flurry of Java vs. JavaScript posts/threads/blogs, with all kinds of speed tests, etc etc.

The conclusion - the language probably made no difference and is not what increased their performance by 3x - it was just that they rewrote all their code and reduced it by 40% that actually made things faster.

My point is, we are heading into a new era of program development, where algorithm design - which has nothing to do with the CPU hardware, parallelization, or the programming language it runs on, is king. This is where the 10x, 100x, or even 1000x speedups can be found.

"Cutting out the unnecessary complexities of processing and managing data will be beneficial to everyone" - exactly right! But I wouldn't say this is going to be as simple task. To cut out the unnecessary computations in an algorithm requires a near god-like knowledge of all possibilities in input/output/computation. To put it more poetically: the daily activities of a child is extremely simple. Their inputs and outputs are, not complex... but it still takes an adult, aware of the entire picture that is a human life, to design this incredibly boring day.

ADD COMMENT
1
Entering edit mode

In other words, methods matter -- and I couldn't agree more :)

ADD REPLY
1
Entering edit mode

A comment on accidental complexities: When data is too large all of a sudden we have to deal with issues we usually do not have to. IMO merely reducing the data to less and more informative will have an massive effect - without changing anything else.

ADD REPLY
4
Entering edit mode
9.5 years ago
enxxx23 ▴ 280

Hold your horses!

I think that such a conclusion "we should not need massive infrastructure, CPU farms, cloud computing, system admins to do bioinformatics" has to do more with sensationalism than with reality!

The release to the public of kallisto tool for Near-optimal RNA-Seq quantification is a welcome advancement which we all have been waiting for a long time already!

Here are some comments:

  • not all RNA-seq experiments/projects are about doing quantification of transcripts (e.g. finding fusion genes in RNA-seq is very important in many cases)!
  • kallisto tool (and also salmon tool) are still work in progress (i.e. unfinished) for the simple fact that there are not yet any official tool (i.e. mature and special designed for kallisto/salmon) which allows one to use their output for doing differentially expressed transcripts analysis (for example, using limma/voom/edgeR/DESeq packages for analysis of Kallisto's output goes against the express advice of those packages' authors)

Kallisto and Salmon are nice/fast/needed/very-good tools but I see no writing on the wall that this can be generalized to DNA-seq or entire field of bioinformatics.

ADD COMMENT
2
Entering edit mode

The bit about the infrastructure was not as much a conclusion but rather a hope that eventually will not need that.

As for statistical methods those too will evolve - for the better. Focus will be on combining repeated observations rather than modeling large amounts of data obtained in one batch.

ADD REPLY
0
Entering edit mode

Completely agree on "Focus will be on combining repeated observations rather than modeling large amounts of data obtained in one batch" - important point!

ADD REPLY
1
Entering edit mode
9.5 years ago
travcollier ▴ 210

Of course, lots of questions involve looking at large populations of sequences instead of just trying to get one (or a few). Yep, very long reads will be awesomely useful, but they won't solve all (or even most IMO) the computational constraints.

ADD COMMENT

Login before adding your answer.

Traffic: 1905 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6