Best method to call SNPs/genotype many bam files from only a defined list of variants/SNPs?
2
0
Entering edit mode
8.6 years ago
mmats010 ▴ 80

Let's say I generated a list of 25,000 SNPs across the genome of my particular organism (which is a diploid non-model with no publically available variant resources). So, given this list of SNPs (in an appropriate format, VCF?) is there a standard method available to call genotypes from input bam files for ONLY the 25,000 variants on this list? My ideal output would be a file with 25,000 genotype calls for each individual input bam file, with the actual genotype calls being along the lines of "AG" or "GG", for example.

Essentially, is there a way to do the same thing you would do for a genotyping microarray, but applying it to illumina reads?

I'm vaguely aware that samtools mpileup can do something like this, but I'm looking for a simpler method to actually call the genotypes that doesn't involve manually calculating read ratios per site (although that data could still be useful). Could you force Haplotypecaller to do something like this?

As for my actual application: I am illumina sequencing ~70 members of a mapping population with the intent of building a linkage map to validate and scaffold a reference genome assembly.

Thanks, Mike

samtools genotyping bam sequencing genome • 4.8k views
ADD COMMENT
0
Entering edit mode

I think that GATK Haplotypecaller using flag --dbsnp your_25000_SNPs.vcf should answer your question.

ADD REPLY
0
Entering edit mode
8.6 years ago
igor 13k

GATK HaplotypeCaller has an --alleles option:

The set of alleles at which to genotype when --genotyping_mode is GENOTYPE_GIVEN_ALLELES When the UnifiedGenotyper is put into GENOTYPE_GIVEN_ALLELES mode it will genotype the samples using only the alleles provide in this rod binding

https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php

ADD COMMENT
0
Entering edit mode
8.3 years ago
mmats010 ▴ 80

Hi @igor and @Nicola Casiraghi, Unfortunately, both of your approaches seem to have failed for me.

After carefully filtering a file containing off of my SNPs (with a PASS during filtering) I am interested in, I came up with a my_25000_SNPs.vcf file. However, if i pass this VCF file into the --dbSNP argument, while in "--genotyping_mode GENOTYPE_GIVEN_ALLELES" mode, the output file gives me exactly zero results. Annoyingly, the process still takes as long as one might expect HaplotypeCaller to process ~25000 sites but doesn't give me an error message. The output file is my VCF file but with zero variant rows.

I will add that I have tried variations of both of your solutions, as well as adding -L my_25000_SNPs.vcf while in both "-DISCOVERY" mode and "GENOTYPE_GIVEN_ALLELES " mode for both a given --dbsnp file.vcf or -L file.vcf

should I give a file.table (two columns only, output of VariantsToTable) in the -L option?

Seems more complicated than it should be, Mike

ADD COMMENT

Login before adding your answer.

Traffic: 1877 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6