Question

Variant Callers for deep sequencing

5

Entering edit mode

10.9 years ago

Rad ▴ 810

Hello,

I have a deep sequencing experiment to analyze and I am hesitant about the variant caller algorithm/program to use as I have some doubts concerning scalability

For a small experiment, not a deep-seq one, we usually rely on the gatk recommendations and best practicies guides, by combining a couple of tasks such as deduplication, this makes the analysis time more or less acceptable. For a deep sequencing experiment, there is no rational about removing duplicates, which makes the variant call part very long and relatively not scalable.

I didnt try the gatk variant caller yet on deep-seq data but I guess it will take a lot of time, I was wondering what is the best option to do the variant call on a deep seq data in term of scalability, if anyone has tried that in the past, I would be grateful if he can give some hints on the best way of doing this.

Thanks

variant deep-sequencing caller snps • 3.9k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by Rad ▴ 810

1

Entering edit mode

Do you have a BAM that's already sorted and aligned? Besides the added I/O burden, I wouldn't expect variant calling on deep sequencing data to take that much longer. I would expect the time needed would scale more with the size of target region. Maybe I'm missing something?

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by Katie D'Aco ★ 1.1k

0

Entering edit mode

Yes I have bams sorted and indexed, am on a stage where I need to call variants on them but still yet not decided which variant caller can handle such a sequencing depth, it is a MiSeq so even when running that on a cluster that would be a long shot I guess. No the question, what variant caller is the best bet for such a coverage ! I don't find any comparison in that sense

ADD REPLY • link 10.9 years ago by Rad ▴ 810

1

Entering edit mode

two questions: What kind of coverage do you have? (10-90x coverage and you should just stick to gatk-bp and be patient or parallelize. >200x coverage and the smart callers will be too slow.) What kind of information do you want to end up with? A mammalian diploid sequence could be seen with high probability by sampling down to 30x. Do it twice if you're not certain. Metagenomes or heterogeneous tumor sequencing need alternate-allele percentage precision and can't be downsampled so far.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by karl.stamm 4.1k

0

Entering edit mode

Thanks Karl, yes I have a coverage in about 10-90X. Let's precise 'slow' for people reading this thread, I talk about weeks of doing a variant call on single run, on an SGE cluster :) I am not doing it, but this is what I want to avoid actually and this is why I asked the question. Besides, I don't want any program to crash because it is not scalable to support high coverage, so I want to avoid those before planning to run my analysis pipeline

ADD REPLY • link 10.9 years ago by Rad ▴ 810

Ram · Accepted Answer · 2014-06-04

4

Entering edit mode

10.9 years ago

Sean Davis 27k

For 90x coverage, GATK works fine for us. You can parallelize GATK by running it per chromosome. If you want to go even further, freebayes can be safely parallelized to non-overlapping regions of chromosomes. In any case, for most callers, 90x shouldn't be too bad. Do make sure that any high-depth filters that are inherent in the defaults are either turned off or set to more sane numbers for your data.

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by Sean Davis 27k

0

Entering edit mode

Thank you sean, And what about 1000x lets say (datasets I receive are variable in coverage) is there anu recommendation for such runs ?

ADD REPLY • link 10.9 years ago by Rad ▴ 810

0

Entering edit mode

Exomes can definitely achieve that level of coverage with modern sequencers, so I'd give it a try. Variant-calling is almost embarrassingly parallel, so for many callers, you can simply run on a per-chromosome or per-region analysis and combine results.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by Sean Davis 27k

0

Entering edit mode

Cool thanks Sean, much appreciated

ADD REPLY • link 10.9 years ago by Rad ▴ 810