I am analysing 1000 genomes variant data to identify patterns of selection. To do this I have downloaded 1000 genomes phase 3 vcf files (directly from here: http://www.internationalgenome.org/data), and annotated them using snpEff with default parameters. Below are just a couple of the extracted variants, including the annotation field:
1 10505 rs548419688 A T 100.0 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9632;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;EUR_AF=0;SAS_AF=0;AA=.|||;VT=SNP;ANN=T|upstream_gene_variant|MODIFIER|DDX11L1|DDX11L1|transcript|NR_046018.2|pseudogene||n.-1369A>T|||||1369|,T|downstream_gene_variant|MODIFIER|WASH7P|WASH7P|transcript|NR_024540.1|pseudogene||n.*3857T>A|||||3857|,T|intergenic_region|MODIFIER|CHR_START-DDX11L1|CHR_START-DDX11L1|intergenic_region|CHR_START-DDX11L1|||n.10505A>T||||||
1 10506 rs568405545 C G 100.0 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9676;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;EUR_AF=0;SAS_AF=0;AA=.|||;VT=SNP;ANN=G|upstream_gene_variant|MODIFIER|DDX11L1|DDX11L1|transcript|NR_046018.2|pseudogene||n.-1368C>G|||||1368|,G|downstream_gene_variant|MODIFIER|WASH7P|WASH7P|transcript|NR_024540.1|pseudogene||n.*3856G>C|||||3856|,G|intergenic_region|MODIFIER|CHR_START-DDX11L1|CHR_START-DDX11L1|intergenic_region|CHR_START-DDX11L1|||n.10506C>G||||||
1 10511 rs534229142 G A 100.0 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9869;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;EUR_AF=0;SAS_AF=0;AA=.|||;VT=SNP;ANN=A|upstream_gene_variant|MODIFIER|DDX11L1|DDX11L1|transcript|NR_046018.2|pseudogene||n.-1363G>A|||||1363|,A|downstream_gene_variant|MODIFIER|WASH7P|WASH7P|transcript|NR_024540.1|pseudogene||n.*3851C>T|||||3851|,A|intergenic_region|MODIFIER|CHR_START-DDX11L1|CHR_START-DDX11L1|intergenic_region|CHR_START-DDX11L1|||n.10511G>A||||||
As you can see each of these seems to have multiple annotations. Eg. the first variant (rs548419688) has been annotated both as an upstream_gene_variant and also a downstream_gene_variant consequence. Which would be the correct consequence to use for pop gen analysis that requires allele frequency and consequence data (I am looking to identify synonymous and non-synonymous changes from the vcf files) and how does one go about choosing which to use?
Hi Sergey, thanks for such a detailed response. I will try working through biomaRt as you suggest. Once I have done this I imagine I will still have to choose between the different consequences (as I mentioned above).
Do you have any advice on how to choose the most appropriate consequence? And am I right in assuming these different annotations are a result of different transcripts?
Yes, those are different transcripts (isoforms). See DMD gene, for example.
You can limit your output to only canonical ones (should be 1 canonical isoform for a gene) with -canon option in SNPeff. In that case you can just trust SNPeff, what it will put there.
The other option is to use VEP. Pull first Gencode basic isoform IDs and coordinates from biomaRt for protein coding genes (having CCDS, for example). Then select the longest gencode basic isoform for each gene. Having this list of ensembl transcript id's, you can select the corresponding unique impact of a variant.
Best, SN
Ah that makes sense. I was outputting all consequences.
One final question: If I take the entire set of variants (not just coding) and then extract simply missense and synonymous variants, wouldn't that subset only be coding variants anyway, as missense and synonymous variants are part of coding regions?
Of course, you can just use information about substitutions (variants) from the annotation. The annotation program should know about coding and non-coding regions. If you want just to calculate pn/ps ratio as exercise and forget about it, it is ok to parse annotation.
But I would not recommend this approach in a long run. If you are going to study molecular evolution, you might want to know, in which genes those substitutions occur, what is the divergence (substitutions between Human and Mouse, for example) in those sites/genes, what are the polymorphisms (variants) in mouse population, you might want to count amino acid substitutions which correspond to 4-fold degenerate nucleotide sites, and so on. To study those things, one day you will find yourself navigating along the genome with your custom script, rather than parsing annotations.
Good luck!
Sergey