Hello,
I have a deep sequencing experiment to analyze and I am hesitant about the variant caller algorithm/program to use as I have some doubts concerning scalability
For a small experiment, not a deep-seq one, we usually rely on the gatk recommendations and best practicies guides, by combining a couple of tasks such as deduplication, this makes the analysis time more or less acceptable. For a deep sequencing experiment, there is no rational about removing duplicates, which makes the variant call part very long and relatively not scalable.
I didnt try the gatk variant caller yet on deep-seq data but I guess it will take a lot of time, I was wondering what is the best option to do the variant call on a deep seq data in term of scalability, if anyone has tried that in the past, I would be grateful if he can give some hints on the best way of doing this.
Thanks
Do you have a BAM that's already sorted and aligned? Besides the added I/O burden, I wouldn't expect variant calling on deep sequencing data to take that much longer. I would expect the time needed would scale more with the size of target region. Maybe I'm missing something?
Yes I have bams sorted and indexed, am on a stage where I need to call variants on them but still yet not decided which variant caller can handle such a sequencing depth, it is a MiSeq so even when running that on a cluster that would be a long shot I guess. No the question, what variant caller is the best bet for such a coverage ! I don't find any comparison in that sense
two questions: What kind of coverage do you have? (10-90x coverage and you should just stick to gatk-bp and be patient or parallelize. >200x coverage and the smart callers will be too slow.) What kind of information do you want to end up with? A mammalian diploid sequence could be seen with high probability by sampling down to 30x. Do it twice if you're not certain. Metagenomes or heterogeneous tumor sequencing need alternate-allele percentage precision and can't be downsampled so far.
Thanks Karl, yes I have a coverage in about 10-90X. Let's precise 'slow' for people reading this thread, I talk about weeks of doing a variant call on single run, on an SGE cluster :) I am not doing it, but this is what I want to avoid actually and this is why I asked the question. Besides, I don't want any program to crash because it is not scalable to support high coverage, so I want to avoid those before planning to run my analysis pipeline