Variant calling in large sample populations
2
0
Entering edit mode
5.8 years ago
robjohn70000 ▴ 160

Hi,

I will like to carry out variant calling from fastq files (whole genomes from a large number of samples ~ several hundreds) for genetics association studies. I have come across some pipelines but not sure which one is the best for what I want to do.

Can anyone with experience in batch variant calling suggest the fast and best pipelines to help with this kind of work. Another question is: as I just want to generate genotypes based on human reference genome for association studies, and using GATK for instance, do I need to use HaplotypeCaller or MuTect for variant calling?

Any advice for batch runs for variant calling will also be welcome. Thanks

sequencing genome sequence gatk • 2.7k views
ADD COMMENT
0
Entering edit mode

Are you aware that hundreds of WGS samples will consume several tens of terabytes for raw data alone? Do you have the computational resources to handle these amounts of data and the respective CPU/memory to align and process them?

ADD REPLY
0
Entering edit mode

Thanks for raising the two potential problems @ATpoint. We have a machine with 250G RAM and 5TB of Hard Drive. However, I wonder if the work is still feasible with these amounts of resources.

ADD REPLY
3
Entering edit mode
5.8 years ago

Use the GATK GVCF way : https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf and https://software.broadinstitute.org/gatk/documentation/article.php?id=3893 : you can create in parallel a *.g.vcf file for each sample and each chromosome and then call GenotypeGVCF for each chromosome and at the end merge the final VCF.

enter image description here

ADD COMMENT
0
Entering edit mode

I really appreciate the info @Pierre Lindenbaum. Do I need to chuck the files by chromosomes at some stage - not really sure about this. Thanks.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks @Pierre Lindenbaum

ADD REPLY
0
Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode

@Pierre Lindenbaum. Thanks for reminiding me. I totally forgot.

ADD REPLY
0
Entering edit mode
5.8 years ago
agata88 ▴ 870

You can try to follow this workflow:

  1. Quality trimming - with Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic)

  2. Mapping reads to human genome - with BWA (http://bio-bwa.sourceforge.net/)

  3. Variant calling - with SAMtools mpileup (http://samtools.sourceforge.net/) or VarScan (http://varscan.sourceforge.net/)

  4. Annotating of detected variants - with SNPEff (http://snpeff.sourceforge.net/)

Try to optimize programs parameters on one or two samples and then run it for the rest of samples.

Best,

Agata

ADD COMMENT
0
Entering edit mode

Please note that samtools mpileup is now deprecated and has been moved to bcftools.

ADD REPLY
0
Entering edit mode

Thanks for the workflow adivse @agata88. I will take care of @ATpoint point on samtools pileup as well.

ADD REPLY

Login before adding your answer.

Traffic: 1904 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6