1000genomes: choosing the best annotation from multiple transcripts
1
0
Entering edit mode
7.4 years ago
spiral01 ▴ 110

I am analysing 1000 genomes variant data to identify patterns of selection. To do this I have downloaded 1000 genomes phase 3 vcf files (directly from here: http://www.internationalgenome.org/data), and annotated them using snpEff with default parameters. Below are just a couple of the extracted variants, including the annotation field:

1   10505   rs548419688 A   T   100.0   PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9632;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;EUR_AF=0;SAS_AF=0;AA=.|||;VT=SNP;ANN=T|upstream_gene_variant|MODIFIER|DDX11L1|DDX11L1|transcript|NR_046018.2|pseudogene||n.-1369A>T|||||1369|,T|downstream_gene_variant|MODIFIER|WASH7P|WASH7P|transcript|NR_024540.1|pseudogene||n.*3857T>A|||||3857|,T|intergenic_region|MODIFIER|CHR_START-DDX11L1|CHR_START-DDX11L1|intergenic_region|CHR_START-DDX11L1|||n.10505A>T||||||


1   10506   rs568405545 C   G   100.0   PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9676;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;EUR_AF=0;SAS_AF=0;AA=.|||;VT=SNP;ANN=G|upstream_gene_variant|MODIFIER|DDX11L1|DDX11L1|transcript|NR_046018.2|pseudogene||n.-1368C>G|||||1368|,G|downstream_gene_variant|MODIFIER|WASH7P|WASH7P|transcript|NR_024540.1|pseudogene||n.*3856G>C|||||3856|,G|intergenic_region|MODIFIER|CHR_START-DDX11L1|CHR_START-DDX11L1|intergenic_region|CHR_START-DDX11L1|||n.10506C>G||||||


1   10511   rs534229142 G   A   100.0   PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9869;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;EUR_AF=0;SAS_AF=0;AA=.|||;VT=SNP;ANN=A|upstream_gene_variant|MODIFIER|DDX11L1|DDX11L1|transcript|NR_046018.2|pseudogene||n.-1363G>A|||||1363|,A|downstream_gene_variant|MODIFIER|WASH7P|WASH7P|transcript|NR_024540.1|pseudogene||n.*3851C>T|||||3851|,A|intergenic_region|MODIFIER|CHR_START-DDX11L1|CHR_START-DDX11L1|intergenic_region|CHR_START-DDX11L1|||n.10511G>A||||||

As you can see each of these seems to have multiple annotations. Eg. the first variant (rs548419688) has been annotated both as an upstream_gene_variant and also a downstream_gene_variant consequence. Which would be the correct consequence to use for pop gen analysis that requires allele frequency and consequence data (I am looking to identify synonymous and non-synonymous changes from the vcf files) and how does one go about choosing which to use?

genome SNP • 2.6k views
ADD COMMENT
2
Entering edit mode
7.4 years ago

Hi spiral01!

What a brave scientist you are! (I mean, so many people were working to find selection in humans before).

As to the point, you'd better start from the other end: to identify regions of the genome you are interested to study.

It looks like you are interested in the protein coding genes only, so you may use biomaRt to extract coordinates of all coding exons. You will have exons.bed file of regions.

Then extract only those variants from the vcf, which belong to the intervals in the bed file.

Then do annotation again, and ask the annotation program to use canonical isoform only (possible for VEP, not sure for SNPeff), or write a script to calculate pn/ps ratio without doing annotation - you already have all necessary coordinates.

BTW, if you are studying protein coding genes, why are you using 1000G not GNOMAD, which has 150k WES genotypes, while 1000G has ~3k? 1000G data make sense if you study something along the whole genome (including noncoding regions).

Good luck!

Sergey

ADD COMMENT
0
Entering edit mode

Hi Sergey, thanks for such a detailed response. I will try working through biomaRt as you suggest. Once I have done this I imagine I will still have to choose between the different consequences (as I mentioned above).

Do you have any advice on how to choose the most appropriate consequence? And am I right in assuming these different annotations are a result of different transcripts?

ADD REPLY
1
Entering edit mode

Yes, those are different transcripts (isoforms). See DMD gene, for example.

You can limit your output to only canonical ones (should be 1 canonical isoform for a gene) with -canon option in SNPeff. In that case you can just trust SNPeff, what it will put there.

The other option is to use VEP. Pull first Gencode basic isoform IDs and coordinates from biomaRt for protein coding genes (having CCDS, for example). Then select the longest gencode basic isoform for each gene. Having this list of ensembl transcript id's, you can select the corresponding unique impact of a variant.

Best, SN

ADD REPLY
0
Entering edit mode

Ah that makes sense. I was outputting all consequences.

One final question: If I take the entire set of variants (not just coding) and then extract simply missense and synonymous variants, wouldn't that subset only be coding variants anyway, as missense and synonymous variants are part of coding regions?

ADD REPLY
1
Entering edit mode

Of course, you can just use information about substitutions (variants) from the annotation. The annotation program should know about coding and non-coding regions. If you want just to calculate pn/ps ratio as exercise and forget about it, it is ok to parse annotation.

But I would not recommend this approach in a long run. If you are going to study molecular evolution, you might want to know, in which genes those substitutions occur, what is the divergence (substitutions between Human and Mouse, for example) in those sites/genes, what are the polymorphisms (variants) in mouse population, you might want to count amino acid substitutions which correspond to 4-fold degenerate nucleotide sites, and so on. To study those things, one day you will find yourself navigating along the genome with your custom script, rather than parsing annotations.

Good luck!

Sergey

ADD REPLY

Login before adding your answer.

Traffic: 2945 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6