For vcf file including information for multiple samples like below:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0/0:48:1:51,51 1/0:48:8:51,51 1/1:43:5:.
In genetics analysis involving familial pedigree, usually we would like to compare the genotyping among different samples (parent vs child). For example, now I wanna select the SNP which appear in all samples, which means the genotyping flag for all the three should be 0/1
or 1/1
.
I know it can be done by some bash command (and this is what I'm doing right now); I'm just curious if VCFTOOLS may have any build-in function for such comparison.
thx
I don't think VCFtools (http://vcftools.sourceforge.net/docs.html) has the functionality we need for this (if it does, I can't find it...), which is why most people write their own Perl or Python scripts to filter their data for pedigree analysis at this stage. You could do it in bash as well, I guess, and search for each genotype flag as a regex.
What is the end goal? Perhaps you want to do more advanced analyses? Do you want to phase the data? Are you looking for a loci that might be disease causing?
Yeah, Zev, I need to find disease-causing SNP, ie. to find which SNP segregates with disease according to pedigree.
If you have the variants called, which it looks like you do, why not let VAAST do the work for you? Our lab developed VAAST and our mailing list is very friendly.
Zev, I agree VAAST looks like an interesting tool. I can see how it may be helpful in identifying pathogenic variants in multigenic disease models, but how does it improve gene finding in autosomal recessive or dominant models where there is one causative gene? If @gerrybio2010 is looking for one gene, pulling out variants shared/not shared by proband and parents with a script will do the trick. I haven't used VAAST, but am certainly willing to give it a try.
Yes, Binary filtering can do the trick. But what if you are missing data? The binary filter can remove the causal variant if the parents don't have coverage. VAAST takes a probabilistic approach with knowledge of the trio and frequencies of the alleles in a background file like 1K genomes. Secondly VAAST scores how deleterious a mutation is by using blossom tables and OMIM data.