I've found that there are a lot of error correction algorithms for sequencing data that are designed for the de novo assembly use case, but when it comes to error correction for MSA-based analyses (specifically variant calling), it seems that options are limited. I'm hoping there's someone here that may be able to weigh in on this; is there any error correction software that you use for variant calling?
Of course, I'm sure most people will say no, but I'm interested in those who are using such software; one of my areas of interest has to do with the approaches such packages may be using to distinguish between true error and true signal.
Thanks
I may give the first error correction tool a try – it seems like it's worth investigating for my test case which is low-frequency variants.
I've used some error correction software designed for the de novo approach and found that it tends to eliminate much of the variation I'm after (fortunately I have a data set with an accompanying truth set), hence my search for algorithms that target the non de novo use case.
On a side note, I'm using BBMap's shuffle.sh (recommended to me in a recent post). I've noticed it used a lot of memory – occasionally failing due to an out of memory error. Is there any way I can reduce the memory footprint of the tool through parameters? I've given the full 96GB allotted by a 256GB compute node to the Java program with the -Xmx argument. The data for this analysis is Human WGS with about 300 million reads; I have 96GB of memory at my disposal but it doesn't appear to be enough.
Many thanks!
Shuffle currently stores all reads in memory, so it does not work with huge datasets. It's very simple. However, sortbyname.sh is more complex and when memory is exceeded it writes temp files, so it can handle arbitrarily large files. Right now, it only does sorting, not shuffling, but shuffling is essentially a sort-like operation so it's fairly trivial to add; maybe I'll do that next week. In the mean time, you might try seqkit. It does sequence shuffling, though I don't know if it needs to store all the reads in memory or not.
Thanks so much for your quick response, I'll give seqkit a go. If I were a Java programmer I'd send a pull request; maybe I'll give it a go nonetheless in April.