I have extracted a region from 1KG webserver like
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -N -e 'select concat("ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.",K.chrom,".phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz"),concat(RIGHT(K.chrom,LENGTH(chrom)-3),":",MIN(K.txStart)+1,"-",MAX(K.txEnd)) from knownGene as K,kgXref as X where K.name=X.kgId and X.geneSymbol="PAX7" ' | xargs ./tabix-0.2.5/tabix -h
I extract a set of individuals,
perl /vcftools_0.1.10/perl/vcf-subset -c HG00254 /RawFiles/PAX7_raw.vcf > PAX7_sub.vcf
I run the Variant_effect_predictor (sift and polyphen). I get X amount of phathogenic variants - meaning the polyphen output is either 'possibly_damaging', 'probably_damaging', 'tolerated' or the SIFT outpout is 'tolerated'.
perl /variant_effect_predictor/variant_effect_predictor.pl -database -i PAX7_sub.vcf -sift b -polyphen p --force_overwrite -o PAX7_vep.txt
I look into the pathogenic variants
grep -E "stop_gained|missense_variant|frameshift" + maindir+geneSymbol+"_vep.txt > " + PAX7_missense_variants.txt
cat PAX7_missense_variants.txt | wc = `47 656 7110`
I wonder why the final number of pathogenic variants does not depend on the number of individuals. Here I only did the analysis on "HG00254" if I do it on another individuals the result would be the same ! isn't weird ? what I am doing wrong ?!
UPDATE For this specific genes, PAX7, in this vcf file, which corresponds to only 1 individual HG00254 there are 47 protein changing variants. however, if we do the same analysis on the original set, including all 1093 individuals, again, there are 47 protein changing variants.
for example I am doing the same analysisn on the original PAX7_raw.vcf
perl /variant_effect_predictor/variant_effect_predictor.pl -database -i /RawFiles/PAX7_raw.vcf -sift b -polyphen p --force_overwrite -o PAX7_raw_vep.txt
grep -E "stop_gained|missense_variant|frameshift" + maindir+geneSymbol+"_vep.txt > " + PAX7_raw_missense_variants.txt
and I count number of row - still 47 !
cat PAX7_raw_missense_variants.txt | wc = `47 656 7110`
Thanks for explaining about the protein-affecting variants. I have added some more code to explain why do I mean