Question

Error Correction Software for Variant Calling

0

Entering edit mode

8.2 years ago

greenstick ▴ 10

I've found that there are a lot of error correction algorithms for sequencing data that are designed for the de novo assembly use case, but when it comes to error correction for MSA-based analyses (specifically variant calling), it seems that options are limited. I'm hoping there's someone here that may be able to weigh in on this; is there any error correction software that you use for variant calling?

Of course, I'm sure most people will say no, but I'm interested in those who are using such software; one of my areas of interest has to do with the approaches such packages may be using to distinguish between true error and true signal.

Thanks

genome sequence alignment snp • 1.9k views

ADD COMMENT • link updated 8.2 years ago by Brian Bushnell 20k • written 8.2 years ago by greenstick ▴ 10

score 1 · Answer 1 · 2017-02-19

1

Entering edit mode

8.2 years ago

Brian Bushnell 20k

I don't use error-correction with variant-calling, but then I mainly do assembly, and I just haven't had an opportunity to test it. I think it's probably fine to use conservative error-correction with variant-calling as long as you are not using amplified data or looking for low-frequency variants.

Also, if you have overlapping paired reads, you can use BBMerge's error-correction with variant-calling without risk, like this:

bbmerge.sh in1=r1.fq in2=r2.fq out=corrected.fq ecco mix

That does correction purely by overlap, so there is no risk of rare variants getting corrected into the majority allele.

Incidentally, I wrote another tool, consect.sh (CONservative Error Correction Tool), which accept multiple versions of the same error-corrected reads and print a consensus - which is the original reads, with only the corrections that all versions agree on. In other words, you correct with multiple different error-correction tools (or, perhaps, the same tool with different kmer lengths), and if any one of the outputs disagrees about a given correction, that base is reverted back to the original.

ADD COMMENT • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

I may give the first error correction tool a try – it seems like it's worth investigating for my test case which is low-frequency variants.

I've used some error correction software designed for the de novo approach and found that it tends to eliminate much of the variation I'm after (fortunately I have a data set with an accompanying truth set), hence my search for algorithms that target the non de novo use case.

On a side note, I'm using BBMap's shuffle.sh (recommended to me in a recent post). I've noticed it used a lot of memory – occasionally failing due to an out of memory error. Is there any way I can reduce the memory footprint of the tool through parameters? I've given the full 96GB allotted by a 256GB compute node to the Java program with the -Xmx argument. The data for this analysis is Human WGS with about 300 million reads; I have 96GB of memory at my disposal but it doesn't appear to be enough.

Many thanks!

ADD REPLY • link 8.2 years ago by greenstick ▴ 10

0

Entering edit mode

Shuffle currently stores all reads in memory, so it does not work with huge datasets. It's very simple. However, sortbyname.sh is more complex and when memory is exceeded it writes temp files, so it can handle arbitrarily large files. Right now, it only does sorting, not shuffling, but shuffling is essentially a sort-like operation so it's fairly trivial to add; maybe I'll do that next week. In the mean time, you might try seqkit. It does sequence shuffling, though I don't know if it needs to store all the reads in memory or not.

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks so much for your quick response, I'll give seqkit a go. If I were a Java programmer I'd send a pull request; maybe I'll give it a go nonetheless in April.

ADD REPLY • link 8.2 years ago by greenstick ▴ 10