Question

Forum:Genomics is not Special. Computational Biologists are reinventing the wheel for big data biology analysis

26

Entering edit mode

10.1 years ago

William ★ 5.3k

Uri Laserson from Cloudera:

Computational Biologist are reinventing the wheel for big data biology analysis, e.g. Cram = column storage, Galaxy = workflow sheduler, GATK = scatter gather. Genomics is not special. HPC is not a good match for biology since it HPC is about compute and Biology is about DATA.

http://www.slideshare.net/urilaserson/genomics-is-not-special-towards-data-intensive-biology

Uri might be biased because of his employer and background but I do think he has a point. It keeps amazing me that the EBI's and Sangers of this world are looking for more Perl programmers to build technology from the 2000's instead of hiring Google or Facebook level talents to create technology of the 2020's using the same technology Google or Facebook are using.

It is time that the bio-informatics world catches up with the leading big data tech from the financial and google like companies of this world. If it doesn't want be eaten alive by those tech companies by being forced to upload all data to Google / Illumina / Amazon.

A nice example of an Academic institute who seem to get this is the Icahn School of Medicine at Mount Sinai, where Eric Schadt and Jeff Hammerbacher (of Facebook and Cloudera fame) are making the switch to using leading edge technology for biology big data analysis.

See links below:

Edit:

Next to the great answers of you all and the Counsyl slides / talk that Ying W points to in the comments below I also found some other presentations that give a more balanced perspective:

genomics cram gatk galaxy • 26k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 10.1 years ago by William ★ 5.3k

11

Entering edit mode

While I read the slides, I keep asking myself: if the current workflow is so bad and the hadoop eco system is so good, why hasn't hadoop been widely adopted? Hadoop is 9 years old and spark is 5. They rose at the same time we were struggling with NGS data. It would seem the best opportunity for hadoop to revolutionize big data analyses in biology.

ADD REPLY • link 10.1 years ago by lh3 33k

1

Entering edit mode

I think you answered your question quite well in another comment. Possible reasons are that the technology is not universally available, there is a lack of expertise in biological fields, and there are not enough reproducible methods. It's also important to remember that anyone advocating so strongly for one approach likely has an agenda, and I think it's clear that this talk was given to promote the company. Of course they think their approach is far superior.

To answer OP's question, perhaps EBI and Sanger are hiring Perl programmers because that is what works (in current academic/research environments). I'm very optimistic about Hadoop like others, but I think we should be cautious and wait for practical applications using Open Source tools.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by SES 8.6k

3

Entering edit mode

I have to chuckle about this a little bit, since Bioinformatics was doing Big Data before Big Data was a thing. Of course one of the big reasons we don't have Google/Facebook-scale tools is that there just hasn't been the same sort of motivation. Many are looking to get the next publication out, and I can only imagine the super-ambitious goals of turning the field on its head are passed over in favor of trying to maintain funding for a year. I think there are a handful of companies out there that are trying to use "real" big data techniques on bioinformatics problems- and that number is only increasing. The need for everyone to publish their own thing can lead to innovation but it also leads to a great deal of duplication of effort. Particularly in the US, we need more collaboration...

ADD REPLY • link 10.1 years ago by Adamc ▴ 680

3

Entering edit mode

Uri Laserson's slides make a big confusion between enterprise (i.e. production => Google/Facebook) and research. The role of bioinformatics and bioinformaticians is to (1) do research, and/or (2) support biological/medical research.

Therefore is very easy to dismiss/kill research methods/approaches or the researchers in any field (life for example: chemistry, mechanics, food industry, etc.) when one applies this kind of thinking.

Enterprise is about doing one thing over and over again in efficient way. There the algorithm/method is already known/published and it is known to work very well for the given problem and one (for example a programmer) just needs to take it from a book/article in implement it in very efficient way accordingly using production/enterprise rules.

Research is about trying new things (every time is a new thing), finding new/different methods/algorithms/statistical-methods, discovering new things/algorithms/methods/biological-insights. Therefore in research and bioinformatics fast prototyping is very important and this is why one sees increased usage in research and bioinformatics of programming languages like Python, Perl, R, etc.

Bioinformatics is ultimately working towards doing research and solving biological and medical challenges while Google/Facebook are working towards communication (and selling more ads). The main goal of Google/Facebook is not to do research.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by enxxx23 ▴ 280

0

Entering edit mode

Fast development is all about proficiency. ADAM developers can implement a hapdoop based prototype very quickly, no slower than us writing a perl script. They also do research. There are interesting bits in the tech notes/papers describing ADAM/avocado. Although I don't have the environment to try, I believe their variant caller, indel realignment, markduplicate etc should compete well with the mainstream tools in terms of capability and accuracy. Of course, achieving that level of proficiency in the hapdoop world is much harder than in the perl/python world.

ADD REPLY • link 10.0 years ago by lh3 33k

0

Entering edit mode

Therein lies the problem with all of this big data stuff, yes you can do it, but what advantage does it really give? Big data tools aren't any easier to write, don't allow for any advances in the actual analysis, aren't faster and so on. The whole point of big data tools is that they scale, specifically they're designed to cope with extremely large sized data sets. That is their sole reason for existence, outside of this the just don't make sense.

If you're not dealing with petabytes and petabytes of data, you're not gaining anything. In many cases you're making it worse by forcing the analysis to fit the scheme of your big data approach. At best you're back at where you started.

I really think it is funny that Laserson accuses bioinformatics of "reinventing the wheel" when he's doing just that. He's taking a successful software ecosystem and making it all over again. Then again, this is someone who seems to think that bioinformatics is just predicting variants in human genomes.

ADD REPLY • link 10.0 years ago by pld 5.1k

2

Entering edit mode

By the way, the comments from last URL (technologyreview) also seem to share a different point of view:

"Having just experienced an elderly parent negotiaing a huge healthcare facility I echo the thoughts below that people matter. Having access to all medical records the instant they're entered is a wonderful thing. Having access to somebody who can explain what they mean is even better."

ADD REPLY • link 10.1 years ago by mikhail.shugay 3.5k

2

Entering edit mode

Another talk/slides on this topic from perspective of biotech here: http://blog.counsyl.com/2013/08/01/do-you-want-to-work-in-genomics-because-its-big-data/

ADD REPLY • link 10.1 years ago by Ying W ★ 4.3k

0

Entering edit mode

very interesting ideas - mainly that genomics is not actually a big data problem today - but it could develop into that

ADD REPLY • link 10.1 years ago by Istvan Albert 102k

1

Entering edit mode

http://search.cpan.org/~drrho/Parallel-MapReduce-0.09/lib/Parallel/MapReduce.pm -- Perl programmers of the 2020's (in 2008)?? xD

ADD REPLY • link updated 22 months ago by Ram 44k • written 10.1 years ago by Michael 55k

0

Entering edit mode

## THIS IS ALL STILL EXPERIMENTAL!!
## DO NOT USE FOR PRODUCTION!!
## LOOK AT THE ROADMAP AND FEEDBACK WHAT YOU FIND IMPORTANT!!

ADD REPLY • link updated 22 months ago by Ram 44k • written 10.1 years ago by WilliamS ▴ 320

1

Entering edit mode

This is just a common courtesy to say the API may change, so any code in production isn't guaranteed to work in the future. That doesn't mean the code doesn't work. Perl itself has many experimental features, and you have to explicitly enable them (or disable the warnings) because they may change.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by SES 8.6k

0

Entering edit mode

"I think the number one need in biology today, is a Steve Jobs." - Eric Schadt

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by John 13k

Ram · Accepted Answer · 2014-11-16

38

Entering edit mode

10.1 years ago

Istvan Albert 102k

I strongly disagree. Genomics is special! Very special! Only people that have never needed to generate any novel insights themselves claim otherwise.

Big data and CPU bound processes are the red-herring of life sciences. Moreover any comparison to other big data systems is flawed. The data collected by these is ridiculously simplistic when compared to even the most trivial biological phenomena. What never ceases to amaze me just how deep the rabbit hole is - ask "why" about the simplest biological question and within two three steps we find ourselves in the dark where we don't know why an event takes place. Customizing a methodology to peculiarities of a biological problem via human interpretation will always be a critical component.

It is only in the world of people with purely computational background where biology is well defined - for them genes are intervals on the genome, transcription and translation are simple algorithmically defined processes between DNA and RNA, each protein has a certain well defined purpose and magically appears at the right time in the right place, DNA is "just" replicated while ignoring the immense complexity of the process etc. Real biology is an incredibly complex process - approaches that prematurely standardize it do more harm than good.

Don't get me stated on Facebook like talent - we need to distinguish between the ability of a company to raise money (and hence hire talent) and their ability to create products that truly matter to the society as whole. Facebook rides high because of the belief that they can monetize the vast masses of people that use their interface (sell them stuff they would not otherwise buy). And that may be true - but it has nothing to do with value of the system, the process, progress of society or making a change to the world.

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by Istvan Albert 102k

6

Entering edit mode

Genomics is special in many aspects, but on data management, it does share a lot in common with other disciplines. After a year of interaction with engineers in GA4GH, I gradually buy that the hadoop eco system is more advanced technically. This then leads to my question in the comment to OP: "why hasn't hadoop been widely adopted?" My tentative answer is that frequently the complexity hadoop brings outweighs its benefit. In the current form, the overwhelming majority of researchers, me included, don't have the skill set to develop or even to effectively use serious tools based on hadoop/spark. Nonetheless, hadoop will still shine for a few applications developed by professional programmers, provided that researchers can guide them into the right direction.

ADD REPLY • link 10.1 years ago by lh3 33k

3

Entering edit mode

I think data representation needs to evolve as well. We can't look to bam files as the means to represent, query and process say tens of thousands of RNA-seq or chipseq experiments. But then what is the alternative. I don't know.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by Istvan Albert 102k

1

Entering edit mode

Data management is also about reproducibility. When SNPs are put to a database we in fact loose lots of information. Just imagine a new SNP calling method is published that provides a significant increase in accuracy. This is always a possibility as we are dealing with high-throughput, not manually curated data.

One will never be able to make use of it unless he has the raw data. One couldn't even check how certain pipeline parameters affect the results. So, actually, the data management in bioinformatics is more about HPC, re-processing of raw data.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by mikhail.shugay 3.5k

1

Entering edit mode

In a bigger world, data management is more about accessibility in my view - how to access 100s of 1000s of samples with all the privacy restrictions and to do research. Reprocessing raw data is our own problems.

ADD REPLY • link 10.1 years ago by lh3 33k

5

Entering edit mode

I completely agree. I always secretly roll my eyes when software engineers or computer scientists give me hand-wavy solutions to genuinely difficult problems in bioinformatics. Solving a problem on an artificial human-made system (computers) with known parameters and boundaries is very different from solving a problem on a completely unknown system.

It's like trying to figure out how a piece of code functions on a human-made computer vs how a piece of code works on an alien computer.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by Damian Kao 16k

score 29 · Accepted Answer · 2014-11-16

The rate limiting factor of genomics and really any of the types of analysis common to bioinformatics (e.g. transcriptomics, pathway analysis) has never really been the lack of computational resources.

Yes I know that improvements in computing power have allowed more and more data to be handled and in doing so has allowed new methods to be developed, old ones to be improved and the scope and scale of projects to be grown. However, Big Data is just a great marketing term to repackage old methods and technology, with some sprinkling of new tech onto it.

Faacebook, Google, and so on actually deal with truly large volumes of data. They deal with enormous live data streams and dump them into enormous stockpiles of data. They need to do analysis on this type of data which as OP says, typical HPC systems designed mostly for numerical simulations don't work.

Biological sciences, medical sciences and in turn their computational and informatic subsets don't have this volume of data nor do they have any of the dynamic nature seen in true "Big Data" technologies. Maybe one day I'll be able to get live streams of transcript expression levels from cells, but not for a while. Something people seem to forget is that although sequencing is getting cheaper, it is enormously expensive when you start to think experimentally. Sure $300 usd per sample is great for mammalian RNA-Seq but if I want to compare 4 experimental conditions with 3 replicates each all over a five point time series, it becomes 18k, which isn't an enormous amount of money by any means, but the point is that it gets expensive rapidly.

Still, not much has changed in the past 5 or even 10 years with what I can do with that data. Great, RNA-Seq or deep sequencing gives me bigger and more precise datasets, so now what? I can do GWAS and I get a list of variants with p-values. I do whatever DE analysis suits me and I get a list of genes with fold changes and p-values. Awesome, now what? Okay let's annotate it with GO/KEGG/etc and now I have a list of functions with p-values. No matter how many different ways I try and cut the cake I still get a list of crap with p-values and maybe some other metric hanging from them.

The rate limiting factor has never been the computational power, and is more infrequently a result of not having enough data, the problem has and still is that no matter how much data is generated or how much cleaner/precise/etc the data is, I still can't do a whole lot of anything with it because the ability to turn these piles of data into information is feeble at best. I get a list of genes that has the usual suspects, which are boring because they've been studied inside and out. Or I get a list of genes with novel items in it and I'm stuck spending a few nights on pubmed scratching my head trying to figure out what is going on.

GO and the like are more or less the same story, all I get is a coarse picture of my data with added noise. Great, "cellular component" that is really useful, or here's a detailed pathway for a process that has nothing to do with the biology I'm looking at (unless it does but that's after a nigh on pubmed). What is worse is that all of these things are massively biased to the people who annotated them to begin with. As soon as you want to look at anything but cancer in mice or humans, you're going to end up scratching your head trying to figure out how some human neuroblastoma crap has anything to do with your virus infected primary cells.

This problem is only going to get worse. I can do MS or a Kinome array along with my RNA-Seq, awesome I can tell which transcripts are there and what the phosphorylation states are. That sounds like a great way to integrate that data to get a really detailed picture of the activation status of pathways on a global scale. Except here we are again, there's zero data on what protein x being phosphorylated means unless its one of the few really well characterized proteins (which are almost all network hubs).

I could keep going, but I think too many people in the computational world who love the latest tech buzzwords and current industry people like to do is throw rocks from their ivory towers. No matter how you process the data, if its a Raspberry Pi or some facebook inspired trendy hadoop/hive/etc system doing the analysis, all that gets spit out is a big ol' list of shit with p-values. Whether it be the latest Illumina tech or the hottest MS approach, all you get is a list of p-values that you dump into your pathway enrichment tool of choice, crap out a few heatmaps and clustering diagrams and call it a day. We can tell people that pathway X was enriched, and genes A, B, C were upregulated or had SNPs or whatever, but our ability to generate biological insight ends there. Its up/down, on/off, has SNPs or doesn't, but chances are that is about as much insight that can be provided. If you have further insight it chances are you're dealing with the usual handful of genes or you're lucky enough to have a ton of ground work already done.

The raw data, the contextual data, the completion flat interaction networks and the amount of contextual/dynamic annotations for networks and pathways are still poor, incomplete and highly biased to whoever put them there. This is where people forget that a good bioinformatician is a good biologist, or at least should be. I do about 50/50 wet and dry work, so maybe I have a warped perspective, but ivory towers are still a huge problem in this field.

Until Facebook talent can stare into the cytoplasm of my cells, tissues and compartments and hear them whisper their secrets, I'll continue to have searing eye pain every time I hear someone say that our problem is that we're failing to keep up with the cool kids and whatever trend they're on.

Also OP, don't forget that there's plenty of things that benefit from traditional HPCs. If anything I think that this will only increase with time as physical and dynamical simulations become more important tools for analysis. Computational biology isn't just GWAS and genomics.

Ram · Accepted Answer · 2014-11-17

14

Entering edit mode

10.1 years ago

Jeremy Leipzig 22k

I agree it's kind of unfortunate Uri Laserson had to start his relationship with the bioinformatics community with a "you all suck" presentation. I think all will be forgiven if ADAM can deliver on even half of what it promises.

The low hanging fruit for Big Data Genomics are variant queries (which of my 10,000 samples has variants at or near this position), since they closely resemble the type of ecommerce problem solved by search engines (small data, close to splittable). However, from this thread it seems they are not really even close to getting decent performance, so maybe a little humility is in order:

https://groups.google.com/forum/#!topic/adam-developers/U-pAEUMCmDQ

ADD COMMENT • link 10.1 years ago by Jeremy Leipzig 22k

4

Entering edit mode

It seems like the idea was to not just change the nature of the data to fit the "Big Data" approach, but it seems like usage cases either weren't considered or were also squished into typical "Big Data" approaches. I'm not sure if this is due to a lack of experience in this field or a lack of humility.

At where I work we had a similar experience with people who were trying to start a "Big Data" project with us. They came in with SEMOSS and attempted to convince us that our usage cases should change to meet what they had. They did so little homework on our field, they kept going between reinventing the wheel (e.g. stuff like DAVID, IPA) and novel but non-trivial to unsolved problems (that are the focus of large research efforts, e.g. text mining) as if no one had tried yet.

It is really embarrassing when someone asks one of the heads of OBO how the data in OBO allows them to relate various measurements (e.g. SNPs and disease).

I'm all for the next technology that allows increased performance levels in a reasonable way, but it is really frustrating when people get sold on buzz words or they try and sell to the less informed.

ADD REPLY • link 10.1 years ago by pld 5.1k

1

Entering edit mode

They don't have an index on chromosome and position but are planning to add an index. They currently are optimized for transforming or querying the whole file; bam to vcf or querying / statistics / machine learning on the whole vcf file.

Mostly I think they are working on germline and somatic variant calling.

See for example this, population stratification on a 1000 genomes chromosome vcf under a minute on 40 amazon machines. http://www.slideshare.net/noootsab/lightning-fast-genomics-with-spark-adam-and-scala?ref=http://bdgenomics.org/

Getting a small section of a vcf file based on chromosome and position is the only query that is fast on vcf's on a single disk, because of the index. It is an important and widely used query. Any other query and you need to read the whole vcf file at least once on a single machine, which will take time for 1000 genomes size vcf data.

They have plans to put an index on the parquet file (=column database) which stores the vcf. Another options people are looking at is using a Apache Cassandra (is distributed record database from Facebook) instead for storing the data. The compression advantage which a column database bring would be lost then.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by William ★ 5.3k

6

Entering edit mode

I would have thought the issue of adding a chromosome and position index would be a conversation they would have on day 1. I am also surprised they are talking about indexing like it's some kind of advanced feature. Isn't the point of using a column store that basically everything gets indexed implicitly?

ADD REPLY • link 10.1 years ago by Jeremy Leipzig 22k

Ram · Accepted Answer · 2014-11-16

Could only wonder what was said at slide #7 -- the design of IMGT was always killing me :)

As for hiring Google/Facebook talents, we are also at risk of making nicely looking software with a good architecture which actually doesn't produce much biological insight and are not flexible enough to be adapted by small labs.

Just consider Google moonshot project example. Is the mapping of disease biomarkers in 174 subjects a "moonshot", or those are 1990s? Are those guys experienced enough in the field of genomics to really provide something useful to biologists? Bioinformaticians are mostly adapting the technology to their needs, and one does not need Cloud technology for comparing 4 RNA-seq samples, it could be left for a larger-scale projects.

Another example is Illumina BaseSpace, which forces to make some nasty tricks with cookies just to download the data via wget. One of the primary goals of Illumina/Amazon/Google is in fact to lock their clients inside of their services :)

And we must remember that our primary goal is to help to provide insights in life sciences. We should make as much use of state-of-art IT technology and infrastructure as possible, yet it will never be able to make custom bioinformatics analysis for you (remember there is no free lunch).

score 10 · Accepted Answer · 2014-11-18

There actually is a lot of reinventing the wheel in genomics. There are a number of reasons for this.

First, consider the economic and structural organization of the research enterprise. Much of it is driven by small teams that have to get grant funding for specific research projects. There is little central planning of biological computing infrastructure. A commercial organization the size of Facebook, Google, Amazon has the incentive and resources to invest in internal utility programs that provide services to the rest of the enterprise. Few academic research organizations have either. There are also restrictions on funds that make it relatively easy to spend hundreds of thousands of dollars on a computer cluster but make it difficult to spend the same money on cloud services.

Second, many of the people making architectural decisions have had very bad experiences with the latest and greatest "enterprisey" technology of a few years ago. Think of the fear and loathing inspired by poor use of XML, the WSDL/SOAP web services stack, and poorly-specified binary formats implemented by different people in incompatible ways. This is the reason why so many bioinformaticians love text-based formats. This is beginning to change, but slowly.

Third, scaling in biological research works in a different way. People in a commercial enterprise may plan from the start for scaling up to very large computing resources. In biological research it is advantageous to instead plan to scale down to run on someone's laptop without administrative rights. Reimplementing a feature you need will drive adoption more than requiring a morass of brittle dependencies that rot over time. At whiz-bang technology companies performance is the primary concern. In academic research, reproducibility and data interchange are more important.

These reasons won't magically go away when you tell people that other technologies are superior. In particular, let's note that talk is cheap. Saying that CRAM is just columnar storage does not get you a superior replacement for CRAM and its ecosystem. Saying that Galaxy is just a workflow manager does not get you a superior replacement for Galaxy and its ecosystem. In all software engineering, the devil is in the details and thinking up an idea for better technology architecture does not solve the problems of real biologists. Architectural switches can often be solutions in search of a problem instead.

Ram · Accepted Answer · 2014-11-18

Outstanding discussion!

My contribution:

The algorithms and methods we use in bioinformatics to discover truths about nature are fundamentally different from algorithms used in other realms of data science.

We are trying to tap into a reality that is external to human thought and governed by forces beyond human understanding or control.

What are Facebook engineers doing?

Whatever it is, it's probably a lot easier than what we are doing.

Ram · Accepted Answer · 2014-11-18

5

Entering edit mode

10.1 years ago

me ▴ 760

ADAM and the BigData story fall into the same conceptual trap that all data warehousing solutions in Bio-Informatics fall in. They are to concerned with how data is stored instead of what the data means. Flat Files and Avro suffer from the same problem in bioinformatics and that is a rapid evolution of mostly compatible variants, see e.g. GFF for a historical use case. Avro does not solve the issue of multiple people trying to extend the same base format to add their own new types of data. In other words the problem is not the Volume but the Variety!

Sure better formats help but AVRO like XML before it is not going to change the bio-informatics field, because both do not deal with the reality of experimental setups.

In my opinion the solution for bioinformatics lies in semantics (RDF and semantic web) because these talk about what we stored and don't care how we store it. For example we can store it actively in Hadoop/Sempala etc..., DMS/Oracle/Virtuoso etc.., BED/SPARQL-BED and many more solutions. Or we can store it at rest in RDF/XML, JSON-LD, Thrift, Binary encoding, HDT compressed, original format plus conversion scripts etc... We can ask queries requiring computation in the same way as we query raw data using SADI. Semantics allow us to connect the large variety of data needed to actually progress biological knowledge.

Unlike SPARK, SPARQL actually talks about connection data sources in between institutions/groups/laptops and servers using federated queries.

I know that the EBI has had a HADOOP cluster for a long time, the same at Vital-IT but HADOOP does not meet their infra needs like LSF/Slurm does. Not because they did not look at it but because after using them they noticed files are really neat when managing lots of different data groups, and ingesting the data from files and into HDFS and removing it after use kind of removes the benefit of HDFS in the first place.

tl;dr; ADAM is just a new Data warehouse and as such is not going to change our world dramatically. Semantic tech will

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by me ▴ 760

1

Entering edit mode

"In my opinion the solution for bioinformatics lies in semantics (RDF and semantic web) because these talk about what we stored and don't care how we store it."

I second that. Integration of heterogeneous data sets is a huge problem in bioinformatics, IMO more so than "big data". It would, e.g., be wonderful to make a semantic query for a protein of interest to collect associated data on a key stroke (functional annotation, is it differentially expressed in my tissue of interest, is it mutated in a particular type of cancer, what's the evidence for it, what's its ortholog in another species, etc). And don't get me started on the ID conversion problem... I could go on for pages with use cases like this. Backed up by google-like server powerhouses to deliver results in seconds would be a "killer app". Of course NCBI/EBI as well as some companies work towards these goals, but the amount of unstructured data sets lying idle on some lab web sites or in paper supplementaries far outpaces the information available in a structured format.

RDF or similar techniques would make this possible and have been around for long enough. So why is it not taking off in bioinformatics? I see two main problems. First, incentives. Bioinformatics research is not like a company where everyone works towards the same goal. Different groups have different interests and want to publish their own "cool thing", then move on to the next. There is no incentive to dump everything in a standardized format into some data warehouse to make it accessible to the rest of the world. Second, skills. A LOT of people in bioinformatics are not computer/data scientists, but biologist-turned-scripters that care more about the biology than technology. They have often no clue what RDF/JSON/SPQRQL/HADOOP is or how to integrate data well (nor do they care, see first point). They lean towards quick-and-dirty solutions that by computer experts would be considered "not state of the art". That is all unfortunate, but I can't see any quick-fix in all this.

But maybe bioinformatics is not so special after all. The above might well apply to many other domains and as a whole is the reason why the dream of the "semantic web" has not materialized (yet).

ADD REPLY • link 10.1 years ago by Christian ★ 3.1k

0

Entering edit mode

Because until SPARQL 1.1. from last year (2013 standardized) , RDF had the promise but not the capability. Since then uptake has vastly improved. This year we can actually do cross database queries using SPARQL 1.1. using multiple sites e.g. beta.sparql.uniprot.org can cross query with the EBI RDF platform as well as more data sources. However, bugs in the sparql implementations are still holding us back.

Identifier mapping (URI mapping) is possible using the identifiers.org sparql endpoint. Only in the last 12 months have people started bringing existing infra to the semantic web without redoing a lot of work.

So while RDF has been around since 1999, SPARQL since 2008, actual useable stuff has only recently appeared. This is why uptake is low, because it seriously has only helped our end users since a short while.

It will take time (a lot of time) before it becomes the go to way of doing things. Because infrastructure just does not change that rapidly!

I only claim that semantics will have a bigger impact than ADAM because ADAM is just another datawarehouse. While semantics allow communication between datawarehouses! One is a single instance the other is a community. Communities achieve more than individuals even if individuals can go faster to start with.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by me ▴ 760

0

Entering edit mode

Avro is a serial format. Avro IDL is the language describing semantics. Avro/IDL is equivalent to XML/RDF conceptually but is cleaner and more succinct in my view.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by lh3 33k

0

Entering edit mode

They are not at all conceptually similar AVRO/IDL is like XML schema. And suffers from similar evolution issues. RDF/XML is a serialisation of the RDF conceptual model and can be translated into turtle, JSON-LD etc.. without loss of information. Because the data embeds the schema in RDF, while Avro only ships with a schema. Its a subtle difference but with large consequences.

Secondly one can always add more RDF to an existing file and no tools break. While AVRO like XML does not guarantee that adding information does not break existing clients (AVRO is beter than XML but not perfect (adding a new enum case breaks clients coded to the previous versions)).

An other thing that AVRO solutions does not have is that you can't say this thing in this file is the same as that thing in that file. e.g. in the AVRO world one can't explicitly say a feature in some UniProt record is the same feature in some NextProt record. While in the RDF world one can, and that is very useful in our field! Also two AVRO records about the same thing can't be auto merged like they can in RDF.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by me ▴ 760

0

Entering edit mode

This is interesting. I would like to know how it works. Suppose we want to represent a 3-column BED. IDL is something like record BED { string chr; long start, end; }. How could this represented in RDF? Say two groups, unaware of each other, both extend BED with a score field of different meanings. How could this handled in the RDF world? Suppose we also have the RDF for GFF where we define chr, start and end fields but with "start" being 1-based (BED is 0-based). How can we know chr/start in BED is same/different from chr/start in GFF? Thanks.

ADD REPLY • link 10.1 years ago by lh3 33k

3

Entering edit mode

First lets talk about the two groups extending a 'BED' in RDF.

prefix : <http://example.org/my_experiment_23/>
prefix faldo: <http://http://biohackathon.org/resource/faldo>
:Pos1 faldo:location [faldo:start [faldo:position 127471196];
                       faldo:end [faldo:position 127472363 ]] .

Then Group A adds a score 'field'

prefix : <http://example.org/my_experiment_23/> 
prefix a: <http://universtityA.edu/group/lala/score> 
:Pos1 a:score 20 .

Then group B adds another score 'field'

prefix : <http://example.org/my_experiment_23/> 
prefix a: <http://consortiumB.ac.uk/stuff/score> 
:Pos1 a:score "Good" .

Then merge these two files

prefix : <http://example.org/my_experiment_23/> 
prefix us: <http://consortiumB.ac.uk/stuff/score>
prefix them: <http://universtityA.edu/group/lala/score>
:Pos1 us:score "Good" .
:Pos1 them:score 20 .

See because both groups give their 'score' field a complete name (URI).

Now for the second case about a position in a GFF file being about the same thing in some BED file.

Faldo in RDF takes care of this by adding a reference to any position.

So instead of saying position 1, we say position 1 on human assembly 37 as maintained by the human genome consortium. And we are explicit about that positions are on a reference sequence over and over again, it makes it possible to merge information in disperate sources. e.g. new devlopements in the RDF presentation of PDB gives positions on both a UniProt sequence as well as the primary sequence as stored at PDB.

Now converting all existing files to put them into RDF is insane, that is why lazy converters are a good idea. The SPARQL-BED I linked in my first reply is some experimental code to show how one can put the information on the semantic web without droping existing infrastructure and redoing everything like what ADAM proposes.

ADD REPLY • link 10.1 years ago by me ▴ 760

0

Entering edit mode

It's essentially a triplestore?

ADD REPLY • link 10.1 years ago by Damian Kao 16k

0

Entering edit mode

Yes, where the DB on disk format happens to be BED. And it is readonly at this time.Works the same for VCF Progress is limited by the fact that I actually work on UniProt and rarely have time to play with genome level infrastructure.

ADD REPLY • link 10.1 years ago by me ▴ 760

0

Entering edit mode

Thanks for the example. There are good things to learn about the RDF model. It is indeed more extensible, though I am not quite convinced that the RDF world is universally better.

ADD REPLY • link 10.1 years ago by lh3 33k

0

Entering edit mode

Universally better is to much to ask for :) RDF has its pain points. But in my opionon it is much better at making it cheaper to integrate and do science with all the data that we have generated in the last 30 years, than all other options.

ADD REPLY • link 10.1 years ago by me ▴ 760

0

Entering edit mode

Although not a mandatory requirement, it helps when various groups re(use) the same ontologies to describe the entities and relationships...

ADD REPLY • link 9.9 years ago by Steve • 0

0

Entering edit mode

I agree semantic approach is great but generally performance of triple stores has been an issue when doing large scale analyses. Some multimodal approaches like https://www.arangodb.com/ look interesting in using the right technology depending on the type of question you want to ask of the data... From a pragmatic point of view RDF can be a little challenging and efforts like JSON-LD which seems to be getting some traction look interesting as well...

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Steve • 0