Question

Self Taught, Where To Start With Bioinformatics Data Management?

10

Entering edit mode

14.5 years ago

Blunders ★ 1.1k

BACKGROUND: It's possible that I will be hired to do basic data management for a bioinformatics company in the next week. I've got a far amount of experience in data management, but never worked in the bioinformatics industry.

GOAL: I'm looking to start rounding out my understanding of major topics related to data management within bioinformatics.

Examples: Tools, Industry Standards, Data Quality Methods, Public Data Sources, MetaData Standards, Existing Open Source Code related to the data quality/management, annotation systems, BioHDF/HDF5 , etc.

Comments/Feedback: Not sure if the question makes sense, or if it's even a good question. But it's my first here on BioStar, so please go easy on me and free feel to comment if I can provide additional information -- or update/delete the question... :-) ...one thing I might add is cheap/free is better, but that does not mean it's always the best option in terms of total cost of ownership. Again, thanks -- and feel free to post any and all information related to bioinformatics data management!

UPDATES:

If anyone wants to edit the tags and add "data management" please feel free to do so.
Related BioStar searchs: data-management
My long-term focus is to grow the body of resources on BioStar related to data management, since there appears to not be very many questions/answers related to the topic.

• 8.1k views

ADD COMMENT • link updated 14.3 years ago by Neilfws 49k • written 14.5 years ago by Blunders ★ 1.1k

Ram · Answer 1 · 2010-11-04

If you have experience in data management, you're ahead of many people in life sciences already. The article A Quick Guide to Organizing Computational Biology Projects is a collection of such obvious, common-sense tips as "have a sensible file/directory hierarchy". I'm not sure which is more depressing: that this was deemed worthy of an academic article, or that it needed to be written in the first place.

In terms of biological data, specifically, I think these points are worth bearing in mind:

There's a lot of it

An obvious point, but there is a huge amount of biological data available. 10 years ago, it was probably feasible to download most of it via FTP to local storage. Today, we have much more storage but not the bandwidth - you'd be waiting for months. So we rely more on remote data stores. Which raises the question: how do you move the computational analysis to the data? That's why people are talking up "the cloud".

It's growing exponentially

Search this site for "growth of biological databases", or similar. 10 years ago there was one human genome sequence (based on several individuals) - 3 Gb. In 10 years time there may be thousands. And that's just one of what, 10 million species?

It's extremely diverse

Biological data come from all fields of biology and in many formats. The largest amount of data is that generated by high-throughput methods: sequencing (nucleic acid and protein) and microarray technology. Structural (X-ray crystallography, NMR) and metabolic/pathway data are also major and growing components. Not to mention the literature databases and data from other areas of biology, such as ecology - geospatial/mapping, population studies.

Every field has its own formats (frequently reinvented many times) and tools. You'll need to gain a broad overview of what's relevant for you.

Versioning is an issue

Primary data are always being updated. So are the results of our analyses, as we develop or discover new computational tools. Data versioning is not feasible. What is feasible: maintain read-only "master copies" of primary data, version your code, then you can say "this input + this code version will reproducibly generate this output."

There are data standards but they are frequently ignored or abused

Biologists don't care about data standards: say "XML", you will get a blank stare. However, many attempts have been made: some are more successful than others but you'll find that they are simply not enforced. Classic example: the NCBI GEO microarray database, which purports to use standards but relaxes them (otherwise nobody would submit records), to the degree that keys/values are practically optional and arbitrary. This makes large-scale analysis across the entire dataset challenging, to say the least.

The (possibly-indexed) flat file is still king in bioinformatics.

There are many public resources but only a few key resources

Every year, the journal Nucleic Acids Research publishes special issues that describe hundreds of online databases and web applications. As we noted in another question, these resources are frequently not persistent. For someone coming into the field, I'd recommend that you focus on the major public resources: NCBI, EBI, Ensembl, PDB, KEGG. If your field is more specific, identify the major resources in that area (e.g. IMG for microbial genomics).

Don't expect to find great APIs for public data. They're improving, but have not even been considered until quite recently. Remember, many of these resources have been around 20 years or more, there's a lot of legacy technology behind them.

In terms of technologies that can help you. (1) Databases, of course. There's growing interest in so-called "NoSQL" solutions, but it's worth looking at older projects such as BioSQL. (2) Open-source bioinformatics libraries, particularly the so-called 'Bio*' projects: Bioperl, BioRuby, Biopython, BioJava. (3) Lots of basic Linux/UNIX command-line skills.

Ram · Answer 2 · 2010-11-04

2

Entering edit mode

14.5 years ago

Andrea_Bio ★ 2.9k

This paper lists all of the bioinformatics buzz words that you might want to know although I don't remember much mention of ontologies which are a big thing in interoperability

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 14.5 years ago by Andrea_Bio ★ 2.9k

2

Entering edit mode

Ontologies are beloved of ontology developers, but not widely deployed by anyone else (e.g. biologists), in my experience.

ADD REPLY • link 14.5 years ago by Neilfws 49k

1

Entering edit mode

I think people talk about ontologies a lot and you have to appreciate that they are important otherwise you offend the people who painstakingly spend years building them, but neilfws may well be right that they aren't widely used in practice.

ADD REPLY • link 14.5 years ago by Andrea_Bio ★ 2.9k

1

Entering edit mode

I just thought that you might also want to be aware of the notion of community curation. That is becoming more popular but naturally has a huge impact on data quality.

ADD REPLY • link 14.5 years ago by Andrea_Bio ★ 2.9k

0

Entering edit mode

@andrea_bio: +1 Thanks, related statement: "Data quality, in turn, is a function of consistent analysis methodology, standard ontology, vocabularies, and dictionaries, and vetting/approval of annotations, not to mention the all-important pruning of bad content." Without knowing to bioinformatics workflow, the statement appears to be one point, and related to your point on ontological consistency being vital to interoperability.

ADD REPLY • link 14.5 years ago by Blunders ★ 1.1k

0

Entering edit mode

I think people talk about ontologies a lot and you have to appreciate that they are important otherwise you offend the people who painstakingly spend years building them, but neilfws may well be right that they aren't widely used in practise.

ADD REPLY • link 14.5 years ago by Andrea_Bio ★ 2.9k

0

Entering edit mode

@andrea_bio: Yes, I agree - but also based on my experience the overhead for end users in the short-term for developing ontologies and using them is often too much; which is not to say they're not important.

ADD REPLY • link 14.5 years ago by Blunders ★ 1.1k

score 1 · Answer 3 · 2010-11-04

1

Entering edit mode

14.5 years ago

Mndoci ★ 1.2k

One piece of advice. As you think about data architectures, etc, make sure you collaborate with a good bioinformatician. Making sure your data structures make biological sense is critical.

ADD COMMENT • link 14.5 years ago by Mndoci ★ 1.2k

0

Entering edit mode

@mndoci: Yes, I'm just the "tech wiz" -- all implementations will be driven by end user requirements/needs... :-)

ADD REPLY • link 14.5 years ago by Blunders ★ 1.1k

0

Entering edit mode

@mndoci: +1 Thanks for posting, and yes, I'm just the "tech wiz" -- all implementations will be driven by end user requirements/needs... :-) –

ADD REPLY • link 14.5 years ago by Blunders ★ 1.1k