Forum:What is the reason for most software errors in Bioinformatics according to you?
8
9
Entering edit mode
8.8 years ago

Hello all,

I am researching on what areas lead to errors in software use/development in Bioinformatics.

What is the most important learning requirement for Bioinformatics software development and use?

What issues would you like to discuss in a hands-on workshop for Bioinformatics software development/software use.

Thanks,
Priyanka

software-error blast sequencing SNP R • 8.2k views
ADD COMMENT
0
Entering edit mode

Dear harne.priyanka, welcome to Biostars, discussion is welcome in the Forum section, but please don't post invitations to take the discussion 'off-line'. Also, as a new user, you can't use this site as a booster for your linkedin profile. Thank you.

ADD REPLY
0
Entering edit mode

Dear Michael, thanks for letting me know. Noted!

ADD REPLY
4
Entering edit mode

See also, What Are The Most Common Stupid Mistakes In Bioinformatics? Otherwise it would be interesting if you could tell us a bit more about your research project. I also think that the sources of errors in software development and coding are quite different from errors in software application, but it might or might not be important.

ADD REPLY
0
Entering edit mode

Yes, that makes sense, errors in development and application would be different and I would like to know more about both.

More on my project: I am a content developer for conferences/workshops and currently I am researching to know what areas would the software side of bioinformatics need updates on. I am planning a hands-on workshop & conference in October this year. The reason for posting is to ensure the 'real' issues are addressed in the conference/workshop.

Please help with your perspective.

ADD REPLY
0
Entering edit mode

You should have this written as answer :)

ADD REPLY
0
Entering edit mode

Yes, that makes sense, errors in development and application would be different and I would like to know more about both.

More on my project: I am a content developer for conferences/workshops and currently I am researching to know what areas would the software side of bioinformatics need updates on. I am planning a hands-on workshop & conference in October this year. The reason for posting is to ensure the 'real' issues are addressed in the conference/workshop.

Please help with your perspective.

ADD REPLY
0
Entering edit mode

Not a direct answer but I'm sure this could help to understand the areas where mistakes/errors can occur A review of bioinformatic pipeline frameworks

ADD REPLY
0
Entering edit mode

Again, can you please disclose your affiliations?

What meeting are you organizing? Where is the location? Can you add your academic/commercial affiliations to your profile?

ADD REPLY
18
Entering edit mode
8.8 years ago

In my opinion the source of most problems is very simple:

Software development practices are not rewarded proportionally to how difficult and time consuming they are to properly implement.

The reasons for this boil down to the fact that the value of a typical bioinformatics software is measured via the value of the new scientific discovery that is made with that tool. That is almost always unrelated to the quality of the software itself. There are few mechanisms in place to reward efforts that would be spent on improving an already published tool or technique, improving its documentation, adding new features to it etc.

ADD COMMENT
15
Entering edit mode
8.8 years ago
brentp 24k

The simple reason that software has bugs (that result in errors) is that writing software is hard.

Let's take a simple example and say you want to write some software to determine if a genetic variant meets an autosomal recessive inheritance pattern. This is easy, right!? Just find sites where the affected parent is het, the unaffected is homozygous reference and the proband is heterozygous. Done. Too easy.

But wait, what if there's another kid? who is unafffected? not too bad. What if they have an unknown genotype (low sequencing depth) where the other 3 samples indicated a putative candidate?

What if both parents are list as unaffected? Do you warn and just report all sites where the kid is het?

Let's go back to the first case, with just the proband and 2 parents. What if the proband has an allele not transmitted from either parent. Sure, it's de novo, but is it autosomal recessive?

What if the user tries to run an autosomal recessive model on a sample with no parents? On parents with no kids?

What about sites where the reported REF allele is actually very rare in the population? In that case, you need to look for unaffecteds == homozygous alternate and affecteds == het.

What if the ped file indicating affection status had extra spaces around the sample names? Does your ped file (http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped) use 1 as unaffected and 2 as affected or 0 as unaffected and 1 as affected? (both are "valid").

Does your software correctly handle multiallelics? With REF=A and ALT=T,C ?

What if the user did single sample calling on each sample independently and then merged the VCFs post-hoc? (hint: you get no reference calls--this happens waaaaay more than it should).

What if the grandparents were sequenced? And both grandparents are unaffected? and the user still wants to use an autosomal recessive model?

What if we add another sample, but they have unknown affection status?

Now let's say we want to add the ability to filter on depth. Well, VCF doesn't (for <v4.3) have="" a="" standard="" for="" reporting="" depth;="" does="" your="" software="" correctly="" pull="" depth="" from="" any="" variant="" caller?<="" p="">

Oh, and what if, by the count of heterozygote calls on the X chromosome it appears that grandpa is a female. And by his (lack of) transmission of alleles to his daughter, he appears to be unrelated to his purported daughter? Does your software account for that?

This is why there are errors in software. It's friggin hard, man.

addendum

By the time you've at least attempted to address all those edge-cases, your code is a maze of if statements. Do your tests exercise every possible path through the maze of elses? Who knows?

ADD COMMENT
2
Entering edit mode

Here there are multiple issues conspiring, the incomplete specification of the problem, having to use a "prematurely optimized" data format (VCF), having to deal with invalid data of surprising complexity, ethical implications of gleaning information that is private and so on. All the while doing it for users that may not understand even a fraction of why this is difficult to do.

ADD REPLY
5
Entering edit mode

Both responses from Brent and Istvan really hit the nail on the head. If you start with really, really hard and complicated problems (this applies to everything in biology once you take a close look) and then fail to reward robust software development you end up with the current mess of unsupported, poorly documented, error-prone software that litters the bioinformatics field. I personally think that institutions need to invest more in this area and that we need top-down leadership on it from funding organizations. I don't think any grants should be awarded that don't fund analysis and software development/support on par with data production. Currently most grants fund ONLY data production and asking for software infrastructure support is still the kiss-of-death for too many applications.

ADD REPLY
5
Entering edit mode
8.8 years ago
Michael 55k

https://www.biostars.org/t/software%20error/ shows you what users of Biostars tag as software error. It is not always easy, based on these case reports, to locate the pebkac exactly on either developer or user site.

I am going to expand this answer slightly, and step-by-step as I think 'software error' only scratches the surface of the problem of correctness and validity in computer science. These aspects are not primarily specific to bioinformatics but generally to computer science. First, consider the following quote (unfortunately I am unable to find the source, if any):

An error is the outcome of a creative process that becomes an error because it is not accepted.

Does this sound acceptable? Now consider this is the attitude of the engineers who constructed the bridge you have to take every morning, does it still sound acceptable? But this is pretty much how a lot of bioinformatics software is written, and the error message is just the tip of the iceberg, as if the bridge crumbles because you drive on it with an orange car but it has only been tested with blue and yellow cars crossing. A lot of bioinformatics software is written without following without formal specification, validation, testing, or adhering to established design principles, most of bioinformatics software are hacks. The obvious software error message tells you immediately, "sorry, this input data you are using, or parameter combination I have never seen and definitely cannot do anything relevant with". Also, a core dump is rarely a biologically sensible prediction.

Therefore, I would like to rephrase the question. Instead of looking at software errors only I would like to ask:

What can be done to improve correctness of bioinformatics software?

There are at least these two aspects related to the quality of an implementation, that we should pay respect to separately and that are too mixed in our previous discussion in this thread:

  • validity (are we doing the right thing?)
  • correctness (are we doing it right? this includes bugs, and the correctness of the algorithm)

The Application Domain of Bioinformatics is very complex

Bioinformatics deals with complex tasks, because it is a scientific field. Sometimes with tasks that are known or believed to be intractable, such as predictions. Look at software like I-Tasser, it takes on the impossible challenge to predict the protein 3D-structure from the sequence, and works well for many proteins. Of course it is using existing 3-D structures (these have errors as well), and integrates a multitude of different algorithms and heuristics into its work-flow. Such complex models are hard to build and hard to evaluate. There is little independent data for testing. This again results in ill-specified or underspecified problems. The example in brentp's post also nicely illustrates the specification issue for a rather 'simple problem'.

Proving the correctness of an algorithm or implementation is hard

While studying computer-science, one might get an introduction to algorithms and data. This course might also include a short introduction on proving the correctness of an implementation, but the overall impression is that proving the correctness of an implementation is too complicated for anything except the most simple of problems. So even for well specified problems, the solutions will involve a lot of trial and error. What is more, it cannot be decided by a computer program, if another program will even terminate, so there is no automatic way of checking correctness in sight. Automated testing is possibly underused.

Writing bioinformatics code is difficult because developers need to understand biology and programming

People who develop bioinformatics software have to understand both worlds, the application domain and programming well, they often have a focus on one of them. For example, they might be biologists with limited training in computer science or vice-versa, the teams that develop software might not equally represent both worlds either, or smaller programs are developed by single persons. There is always a trade-off in how good one can be in multiple domains. Developers might or might not be trained in modern concepts of software development and computer programming, such as test-driven development, pair programming, design patterns, revision control.

Sustainability, documentation and maintenance is not rewarded for scientific software

That refers to the aspect Istvan has mentioned already. If software is developed by a single PhD student or post-doc, there is not always much capacity, funding, or incentive to maintain the software or fix errors after the person moves on. Also, fixing bugs or improving existing software doesn't give publications, so there might be more incentive to develop a new software with -other- but not necessarily less flaws.

ADD COMMENT
1
Entering edit mode

Just added 'PEBKAC' to my vocabulary!

ADD REPLY
1
Entering edit mode
8.8 years ago

I think that there are two sorts of errors:

  • Bugs in software code: typos, improper handling of artifact (try to BLAST the NNNNNNNNNNNNNN sequence) and complex inputs, etc. Those are typically easy (of course some code parts such as parallel processing can be a real headache) to catch by opening your code to public, using issue tracker and unit tests.

  • Logical errors in models, algorithms and pipelines. The hard part here is that real data that goes into the software is highly diverse and almost often different from synthetic and random datasets used in testing. As an example here, imagine sequencing data for a panel of 100 exons for variant calling. One can try to speed up the pipeline by generate a reference for those exons and running his favorite aligner using it. This will work fine, however in case there exists a retrogene highly similar to one of those exons there will be a consistent number of false-positives indistinguishable from real variants. Spotting such errors requires both deep understanding of the subject and input from collaborators.

ADD COMMENT
1
Entering edit mode
8.8 years ago

Dear,

I think software errors are inevitable.However if you are going to rank evils, than I would say the greatest evil in software is the inclusion of unnecessary higher order functions. A software can be viewed as a system of cogwheels where each cogwheel 'should be' stationary. If one or some of your cogwheels/gears keeps vibrating or going here or there, than you increase the fragility of the system.

Your functions should do 1 thing, and only 1 thing at a time. They should be specialized and orthogonal to each other. You should try to cut down the number of higher order functions to a bare minimum.

Of course in reality this is seldom possible. If you try to apply these points literally in your software, you spend all your time on code design, increase too much rigidity in your code and do little progress in means of output. If you want to progress fast, than your code incorporates too much chaos.

For me, a good software is the one that keeps the balance. Errors are what you get when your design process tilts too much to one way or the other.

Regards,

ADD COMMENT
0
Entering edit mode
8.8 years ago

IMO, the biggest source of usage error (particularly for newbies) is a lack of version control/validation for file formats and reference genomes.

ADD COMMENT
0
Entering edit mode

Thanks for sharing this harold.smith.tarheel . Can I also get your opinion about: If you were to attend a workshop around software for Bioinformatics, any latest updates you would find interesting to hear about?

ADD REPLY
0
Entering edit mode
8.8 years ago
Anima Mundi ★ 2.9k

At least for me, a relevant source of software errors is semantic errors while scripting. Some time ago, I spent almost one month on a script that did not behave exactly as I intended it to behave, finally I just gave up!

ADD COMMENT
0
Entering edit mode
8.8 years ago
Naren ▴ 1000

Most of my errors are related to dependencies which are updated automatically and not compatible with the old codes. Some of my tools worked in Python2 but I updated to Python3 out of excitement and lost compatibility.

ADD COMMENT
0
Entering edit mode

Probably I've found a correct person to ask this question and hopefully receive an answer. As far as I know usually the old versions of many programming languages are compatible with their new ones. Why is there an exception for Python2 and Python3? There are something about 2500 programming languages now, is it not the rule in general? I worry about myself - I've learnt Python2 and now I am at a loss - should I change to Python3 and learn everything again? Right now the difference is not very large, I know about it, but it will become wider and deeper in the future, I am afraid. Many thanks for your answer!

ADD REPLY
1
Entering edit mode

Wiki.python.org goes into depth on the differences between Python 2.7 and 3.3, saying that there are benefits to each. It really depends on what your are trying to achieve. But, in summation: “Python 2.x is legacy, Python 3.x is the present and future of the language.”

There are subtle differences between the two. But the biggest difference is the print statement.
The most visible (difference) is probably the way the “print” statement works. It’s different enough that the same script won’t be able to run on both versions at the same time, but pick one and you’ll be fine.

ADD REPLY
1
Entering edit mode

Does not matter what computer language you learn - what is important that you learn how to think in terms that can be translated into a computer language. Whereas building up programming skills will take years and writing tens of thousands of line of code, once you know how to program well in any language it takes little more than weeks or perhaps a month or two to switch to another language and be productive with it.

ADD REPLY
0
Entering edit mode

When python was designed, they got many things right, but some things were very wrong. For example, "True" and "False" are variables in python 2, not, as you might expect, immutable values of truthiness and falsiness. enter image description here

This can become more than a joke/annoyance as you start to write more performant code. In truth, python can be as fast as any other VM-compiled code (such as Java), but stuff like this really doesn't help because if you write something like "if x: y = True; else y = False", python has to go and find out what the value of True and False are before it knows what to set y as. This makes that line of code about 40% slower. The reason given for this by Guido was "Special cases aren't special enough", however, in python 3 they decided True/False maybe are special enough, and now they're keywords like they probably always should have been:

enter image description here

Conversely, in py2 "print" is a keyword, rather than a function. Any other function you would have somefunction(input_data). print should be no different. So now in python 3 print is a function rather than a special keyword. Another difference is how they treat unicode. Py2 doesn't distinguish unicode from ASCII. py3 does. Does it matter? No, probably not. Also worth mentioning that py2 used to be a lot faster than py3, but thats no longer the case. They're essentially the same now.

So a lot of these little things have to be fixed up - and that may break backwards compatibility. For this reason, py3 was never meant to be compatible with py2. However, this has been going on for years now, and these days all the core differences between py2 and py3 are essentially now set in stone. Beyond that, more and more features are being added to py3 that weren't in py2. Type hints, iteration optimizations, etc. None of this will be a disaster for a py2 programmer, rather, an added bonus for a py3 programmer. Having said all that, i want to end with I couldn't agree more with what Istvan said - it really doesn't matter. 90% of what you learn as a newcommer has nothing to do with the language, and is all about knowing which data-structure is most appropriate for the job - how to order your thoughts logically - etc etc. That stuff is true of all programming languages, and 5 years from now we'll all be using Julia anyway ;)

ADD REPLY
1
Entering edit mode
ADD REPLY
1
Entering edit mode

IMHO the only substantial and unfixable problem with Python 2 is the unexpected scoping of the variables. Most beginners and even advanced programmers don't need it, and of course since it is broken they will never even know what that is about and lose the ability to think about programming at a higher level.

Upon parsing the code Python decides which variables are global and which are local and that makes certain type of coding impossible: closures, that is functions that return functions scoped over a variable. All the other issues about Python 2 are IMO mostly cosmetic and subjective opinions that can swing either way.

Example: This works (a is in global scope):

a = 10
def foo():
    print a
foo()

The code below does not work, returns an Unbound Error (a is local and uninitialized at the assignment step):

a = 10
def foo():
    a = a + 1
    print a
foo()

Python 3 introduced the nonlocal scoping to fix that, but the point is one should not need to state that a variable is nonlocal, the default scoping should be nonlocal to begin with.

ADD REPLY
0
Entering edit mode

Heheh, there are a bunch of good ones there :) But some are kind of silly. The indentation is probably the best thing about python, rather than a problem, and this little gem:

the syntax for conditional-as-expression is awkward in Python (x if cond else y). Compare to C-like languages: (cond ? x : y)

shows that the person writing this list is just making it up as they go along :P

ADD REPLY

Login before adding your answer.

Traffic: 1510 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6