Tutorial:GATK method for filtering vcf lines using GT values at all or multisample level.
1
2
Entering edit mode
7.6 years ago
kirannbishwa01 ★ 1.6k

Below is a short tutorial for filtering vcf lines using GT values at all or multisample level. Filtration of vcf using genotype values at single sample level is mostly explain at GATK, but filtration of vcf sites at population level (multisample genotype annotation) is developing. SnpSift and GATK have methods to do it, but less examples are explained.

I just had to do some filtering of my own data and tested several methods to see what the outcome would be. I am sharing those here for others and myself, if it is ever needed again.

genome vcf genotype variant-filtration • 4.5k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
4
Entering edit mode
7.6 years ago
kirannbishwa01 ★ 1.6k

I just had to do some filtering of my own data and tested several methods (to see what the outcome would be). I am sharing those here for others and myself, if it is ever needed again.

Tutorials .... !!!

- method to call the sites that are all homozygous Variants (1/1, 2/2 genotype)

java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() == 6'
# outcome: no sites will have ambigous (./.) and/or homRef (0/0). All other homVar are allowed

- in the above script if we want the site where at least 4 samples are homozygous ref or variants

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() >= 4'
# outcome: atleast 4 samples are homVar, rest can be any genotype (./., 0/0, 0/1, etc.)

- at least 1 sample is homVar (1/1, 2/2 or 3/3)

java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() != 0'

- all the samples are homRef (0/0 genotypes)

java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomRefCount() == 6'

Note: 6 chr = 3 samples (when samples are diploid)

- sites that are all HomVar or nocalls (./.)

java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() == vc.getCalledChrCount()/2'
# outcome: no 0/0, 0/1
# Note: samples with GT = ./. are not counted as called chr

Useful after samples are sub-sampled from multisample vcfs

- sites where no genotypes were called for all the samples in the selected vcf - or say, select vcf sites that are not called in any sample (i.e don't have any genotypes, only contain ./.)

java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() == 0'
# outcome = vcf sites (lines) that are all no calls (GT = ./.) on all samples are selected

- samples are either homRef (0/0) or no call (./.)

java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomRefCount() == vc.getCalledChrCount()/2'
# outcome: vcf sites where all the samples are GT = ./. (no call) or GT = 0/0

- site where 6 chromosomes (exact) or 3 samples were called for diploid genome

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() == 6'
# Note of Caution: 6 chr = 3 samples (when samples are diploid)

- site where at least 3 samples (diploid) are called

java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() >=6'

- if I have 6-sample vcf

java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() ==12'
# 12 haploid chr = 6 samples (diploid)
# outcome: sites that were called in all samples, there will be no GT = ./.

- select vcf sites where no sample has hetVar (no GT = 0/1, 0/2), only GT = ./., 0/0, 1/1, 2/2

java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHetCount() == 0'

- select vcf sites where no sample has hetVar (no Gt = 0/1, 0/2), and at least one sample is HomVar (GT = 1/1)

java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHetCount() == 0 && vc.getHomVarCount() >= 1'

- site where at least 1 sample is homVar (GT = 1/1 and/or 2/2 etc.)

java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() >= 1'

- where a site is only homRef for a particular sample and is only SNP

java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select 'vc.getGenotype("ms01e").isHomRef()' -O test_variants.ms01e.Chr_2_3.vcf

- where a site is only homVar for a particular sample and is only SNP

java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select 'vc.getGenotype("ms01e").isHomVar()' -O test_variants.ms01e.Chr_2_3.vcf

- where a site is not homRef for a particular sample and is only SNP. This will also select empty sites

java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select '! vc.getGenotype("ms01e").isHomRef()' -O test_variants.ms01e.Chr_2_3.vcf

NOTE: following genotype call methods won't work:

# to get only heterozygous sites
-select 'vc.getGenotype("ms01e").isHetVar() 

# to get any variant sites
-select 'vc.getGenotype("ms01e").isVar() 

# won't work if getGenotype method is combined
-select 'vc.getGenotype("ms01e").isHomVar() || vc.getGenotype("ms01e".isHomRef()'  

So, I figured out the following methods where count helps with selecting particular site

- where a site is a heterozygous variant site for a particular sample

java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select 'vc.getHetCount() == 1'  -O test_variants.ms01e.vcf

NOTE: 'vc.getGenotype("ms01e").isHetVar()' wont work so generated the above expression.

- where a site is a variant site (either homVar or hetVar) for a particular sample and is only SNP

java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select 'vc.getHetCount() == 1 || vc.getHomVarCount() == 1'  -O test_variants.ms01e.vcf

NOTE: 'vc.getGenotype("NA12878").isHomRef()' || vc.getGenotype("NA12878").isHomVar()' wont work together, so generated this above expression.

Other updates to follow. Thanks,

ADD COMMENT

Login before adding your answer.

Traffic: 1590 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6