Question

addressing software versions

0

Entering edit mode

6.8 years ago

prasundutta87 ▴ 670

Hi,

I started using GATK v4.0.0.0 on some of my WGS samples some days back. Few samples got a bug- "HaplotypeCaller exception: contig must be non-null and not equal to *, and start must be >= 1". I checked online and found that this bug was solved in the new sub-version. I downloaded GATK v4.0.1.2 and started running it on the unsuccessful samples.

Mostly, new versions (not sub-versions) get released when there is a major change in the code/algorithm of the tool. This is a common thing in bioinformatics community that new versions of tools get released, bugs get reported and then new sub-versions get released within a span of few days/months leaving the users in a fix.

In my case, the inherent algorithm did not change, only some bugs got solved. Should I be running the updated sub-version of the tool on the successful samples again so that a commonality is maintained? Although, the final output will not change. Or, should I just write while reporting the procedure that GATK 4.0 was used, and not mention the sub-version at all. What is the best practice that should be followed in this case?

best practices gatk software error • 1.2k views

ADD COMMENT • link 6.8 years ago by prasundutta87 ▴ 670

score 0 · Answer 1 · 2018-02-11

0

Entering edit mode

6.8 years ago

GenoMax 147k

If you are comparing multiple samples and reporting the finding in a single publication then preferably all samples should be analyzed using identical version(s) of software packages to prevent any unseen bias. Someone else being able to reproduce the results you are reporting (as long as they use an identical version of the software) is important for reproducible research. To facilitate that, no information should be considered insignificant. It is the best policy to report accurate metadata for all data and informatics software/pipelines.

ADD COMMENT • link 6.8 years ago by GenoMax 147k

0

Entering edit mode

I agree with that..from a publication point of view, it definitely makes sense..I am just concerned over the time being wasted..

ADD REPLY • link 6.8 years ago by prasundutta87 ▴ 670

score 0 · Answer 2 · 2018-02-11

0

Entering edit mode

6.8 years ago

Devon Ryan 104k

If you look at the release notes for version 4.0.0.0 (it annoys me that they use an extra digit in their versioning), you'll see that aside from this bugfix, they also fixed a bug relating to -mbq being ignored before. If you used that, then I would suggest rerunning the variant calling on all of the samples. If you didn't use that, then presumably the results would be identical (sans the problematic sample). If you want to be sure, run one of the non-problematic samples and compare the results.

For what it's worth, the best practice would be to rerun all of the samples...but the best practice isn't always the most sensible one.

ADD COMMENT • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

Luckily I ran with default parameters and had not set -mbq..and I am running the gvcf generation per sample now..theoretically it should not change at all..but again..as genomax suggested..from a publication point of view, this change of version will be a difficult thing to sell..the reviewers will also question me if I mention that sub versions differed...

ADD REPLY • link 6.8 years ago by prasundutta87 ▴ 670