Question

Forum:Thoughts on Bioinformatics and programming

9

Entering edit mode

8.8 years ago

John 13k

In the past, I've been stung by some pretty bad software that was doing silly things that no one in their right mind would have thought would be OK. Unfortunately, the "result" of this tool looked valid (or rather, there was nothing to compare it to), and so until I read the source code, I didn't actually know anything was wrong.

Since that day, I've been a pretty stubborn Bioinformatician. Except for the very complicated and public tools like the mappers or samtools, I don't run software unless I can read the code. Unfortunately, the only bioinformatic-relevant programming language I know is Python, so that somewhat limits how much code I can realistically use. The one big exception is Picard, which is all Java, simply because I trust it's authors. So yes I am also a hypocrite. I would also consider being able to run awk and other unix tools in a pipe to be programming in a wider sense - however its a bit of a grey area since you don't really know what sort or uniq are doing (re. only looking at the first 8kb of data per line for uniq on OSX).

So that's how I behave in practice. However, when people ask me "do you need to be a programmer to do Bioinformatics?" I say "no way - writing tools is not something you should have to do!".

Recently however, I've been doing a lot of mapping - so I've had some time on my hands to really think about this. The result is a few rhetorical questions, and I would be interested to hear people's thoughts :)

If you publish findings using a tool that gave incorrect results, is that your fault? How much of the blame would you accept?
If you ask (or have been asked) to use a new tool for a certain analysis, will you (or your boss) set aside time in the plan for reading the code?
Do you ever use closed-source tools in any step of your workflow? (that have an impact, not, say Oracle Database or a text editor)
Do you think you need to be a programmer to be a Bioinformatician?

The last question is not "to get results" but to truly be a Bioinformatician. I am frequently told by my boss that im not a bioinformatician, I'm a PhD student, so here when I use the term Bioinformatician I really do mean someone who knows what they're doing. Not someone who can simply get some sort of result, but someone who gets the best result, and knows why its the best.

Thank you so much for your time :)

programming • 2.5k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 8.8 years ago by John 13k

2

Entering edit mode

@John: I will add this as a comment since I don't think it qualifies as a full answer.
First of all bioinformatics tools are generating hypotheses. In most cases a researcher should (needs to) independently verify those hypotheses (preferably by experimental means). So even if one was to get incorrect results from an informatics tool, the hope is that error would be caught during independent verification (if Murphy's law applies then all bets are off but then you may not be the first person to have had that happen).
There needs to be an "applied" bioinformatician classification that majority probably fall in. We are not programmers and can't write standalone packages but at the same time have enough knowledge to make changes as needed to tweak output/make sense of how to use code that gets put out. Ultimately the domain knowledge experts that you are working with need to make sense of/check validity of any results you (or the software being used) produced.

ADD REPLY • link 8.8 years ago by GenoMax 148k

0

Entering edit mode

You make a very good point about secondary validation nullifying any systematic errors in the bioinformatic steps. In practice I wish this was done more often - but perhaps it's not as easy as it sounds. If there is a totally different biological assay (which requires totally different tools to analyse) to get the same information, then yes. But without that secondary assay, your options would be to use different software that aims to do the same thing - which may not be compatible by design. For example, CuffDiff for some samples, DESeq for others. Now they cluster by analysis software, not biology - so which one is right? Without knowledge of how the program works, it's difficult to weigh up their pros and cons. But at least you now know there are differences! :)

ADD REPLY • link 8.8 years ago by John 13k

score 4 · Answer 1 · 2016-03-21

I guess this boils down to your definition of what a bioinformatician is, and there's still no real consensus for that. I'd say a bioinformatician is someone who analyses biological data and spends her majority of time in front of the computer, you'd be surprised how much you can get done with a point-and-click interface.

If you publish findings using a tool that gave incorrect results, is that your fault? How much of the blame would you accept?

If I didn't properly check the results, I accept the fault and retract - did other tools agree with the result? Did the result make biological sense? Could we run PCR or anything to replicate the results? Does the result fit with what's known? This thread is a recent example from my own work. I sat forever on the results from this paper because initially, the number of crossovers made absolutely no sense until we realized that the increased number is probably due to misassemblies similar to what http://elifesciences.org/content/2/e01426v1 did.

If you ask (or have been asked) to use a new tool for a certain analysis, will you (or your boss) set aside time in the plan for reading the code?

Most of the time I won't do that for reasons of time. I could go through the code-base, try to make sense of the author's comments, and grapple with the author's coding style. I sometimes do this when I run into a weird error, like a segmentation fault. If you want to have time to publish, you can't read the code of the tools you use, your PhD is only 3 years long.

Do you ever use closed-source tools in any step of your workflow? (that have an impact, not, say Oracle Database or a text editor)

I do use closed-source SOAPaligner which mostly agrees with what I've seen with BWA/Bowtie, I just like the -r settings there. I can't think of any other closed-source tool I use at the moment.

Do you think you need to be a programmer to be a Bioinformatician?

This ties into the above question of definition - I don't think you have to be a programmer, but you will be a much more efficient worker if you can throw your analyses in a for loop, or write your own script to convert data-formats.

score 4 · Answer 2 · 2016-03-21

4

Entering edit mode

8.8 years ago

Devon Ryan 105k

Relatively little if the results looked reasonable, passed some reasonable validation, etc. This is no different from wet-lab experiments where people are bitten all the time by crap antibodies or ELISAs.
No, relatively few people will ever read the code, just as relatively few people will ever validate their commercial kits.
No
I guess it depends on where the line between bioinformatician and data analyst is, if such a distinction exists. In general you don't really need to be able to code for either, though it's more likely to be needed in practice for the former rather than the latter [1].

[1] I don't consider doing a bit of analysis in R or python as "coding" in this context. I'm thinking more, "make a package for pypi or Bioconductor" for that.

ADD COMMENT • link 8.8 years ago by Devon Ryan 105k

0

Entering edit mode

Great point about validating the source being the same as validating a commercial kit or antibody. I hadn't thought of it like that. I suppose one difference is that when you buy a Qiagen kit, you are protected by the fact that money changed hands and the manufacturers have a responsibility that the product should do what it says it does -- more so than open source software distributed as-is with no liability. However, in practice no one ever holds Qiagen's hand to the flame when their kits dont work, so my point is kind of irrelevent. I opened a box with a filter column in it for making the Bioanalyser gels a week ago, and the filter column had cracked in half. Solution? Order another one. So practically, I totally agree - if Science has unanimously agreed to forgive and forget in order to move forwards faster, then it applies to the code too.

ADD REPLY • link 8.8 years ago by John 13k

1

Entering edit mode

If anyone thinks paying a company for a product is any protection then they haven't been in science long enough :P

The same goes for commercial software/analysis. My path to bioinformatics started when a company completely screwed up an analysis (using their proprietary commercial pipeline) when I was a post-doc, thereby wasting a nice-sized wad of cash. Were it not for that, I'd probably still be doing neuroscience :P

ADD REPLY • link 8.8 years ago by Devon Ryan 105k

score 3 · Answer 3 · 2016-03-21

I'll probably be among the least experienced person here, but i'd like to put in my two cents for whatever it's worth. I'm also interested in varying views between those that can read and write only minimal amounts of code due to lack of experience such as myself, versus those of you whom are much more experienced and versed in programming.

I'll be the first to admit that much of my job ... come to think of it 95% of my job relies on open-source software for data analysis. My wet lab workers feed me different sequencing data and it then becomes my job to read and read ... and read papers until I am relatively confident in my results. Only recently have I begun having a hand in generating the sequencing data myself. My coding ability is limited to a bit of python, a bit of R.

I personally feel that publishing findings with incorrect results is in large your fault. There may be SOME blame on the author of the tool, but usually unless it is a very specialized software you can find something to compare it to. A paper, another tool, etc. I'd accept the majority of the blame.
My PI has little to no experience in reading code so this probably wouldn't happen. I also admit to not specifically taking time to read the majority of the code I use. This is in large part because I understand little of it.
I've never used any closed-sourced software in any of my data analysis.
When people ask me what I do for a living, I always refer to myself as computational biologist. I think being able to consider yourself a Bioinformatician requires a much larger degree of expertise in programming and all the areas it encompasses.