Question

Parsing Variants From 1000 Genomes Data

4

Entering edit mode

12.6 years ago

Halit ▴ 90

Hey guys,

I need to extract polymorphism data from 1000 genomes data for about 85 coding genes.

What I need in particular for each gene is (1) silent polymorphisms, (2) amino acid changing polymorphisms and (3) stop-inducing polymorphisms and (4) global allele frequencies (I guess, for n = 1029).

I know I can get this information through the official 1000 genomes browser or Ensemble web site. But is there anyway that I can automate this process and do it in a go?

I thought of the following strategy, but perhaps you might suggest a more clean and faster way.

Get chromosomal position data for each gene (i.e., exon start / end, +/- strand) [from UCSC perhaps?!]
Download genotype files (vcf) for each chromosome for 1000 genomes [phase 1, release v3, March 2011 calls or should I just stick to high coverage data?]
Pick vcf for Chr 1; check whether any SNP falls inbetween some exons, if it does, note it down.
Among the noted SNPs, parse allele variant and allele frequency (AF)
Determine the amino acid position of the corresponding SNP
Check the resulting amino acid state when the variant is introduced (be careful about +/- strand)
Classify the polymorphisms and report the corresponding AF

1000genomes snp • 5.8k views

ADD COMMENT • link updated 12.3 years ago by Laura ★ 1.8k • written 12.6 years ago by Halit ▴ 90

1

Entering edit mode

Can you update us how eventually did you performed this analysis ?

ADD REPLY • link 11.1 years ago by User 1933 ▴ 360

Ram · Answer 1 · 2012-04-22

5

Entering edit mode

12.6 years ago

Martin Morgan ★ 1.6k

The locateVariants and predictCoding functions in VariantAnnotation do these operations in R / Bioconductor; there are some issues with strand handling that are likely to be addressed in the next day or so (e.g., by April 25, 2012). Data comes from (user) VCF files, with things like genome and UCSC known genes provided by annotation packages, e.g., BSgenome.Hsapiens.UCSC.hg19 and TxDb.Hsapiens.UCSC.hg19.knownGene. Both genome and transcript data bases can be customized for non-model organisms. See the VariantAnnotation vignette (pdf) for details.

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 12.6 years ago by Martin Morgan ★ 1.6k

0

Entering edit mode

Thanks Martin. This looks pretty useful. I wanted to give a try with "finding all coding SNPs in chr22" but I failed in the first step. (1) I downloaded the vcf file for chr22 from 1KG ftpm (2) loaded the relevant package by calling library(VariantAnnotation), (3) pointed to the file by specifying inputFile <- system.file("extdata", "Chromosome22.vcf.gz", package="VariantAnnotation"), and then (4) reading the file content vcf <- readVcf(inputFile, "hg19"). At this step, I get the following error: Error: scanVcf: record 28059 INFO '0|0:0.000:-0.05,-0.96,-5.00' not found path: C:UsersPC101517DocumentsRwin-library2.15VariantAnnotationextdataChromosome22.vcf.gz /// Any thoughts would be very much appreciated.

ADD REPLY • link 12.6 years ago by Halit ▴ 90

0

Entering edit mode

VariantAnnotation is complaining about a VCF record (28059) that it cannot parse; it looks like a genotype record is trying to be parsed as an INFO field. I'd suggest posting to the Bioconductor mailing list where you can provide sessionInfo and perhaps that portion of the file that is causing problems.

ADD REPLY • link 12.6 years ago by Martin Morgan ★ 1.6k

score 4 · Answer 2 · 2012-04-23

4

Entering edit mode

12.6 years ago

Laura ★ 1.8k

You could find your gene in the 1000 genomes browser

http://browser.1000genomes.org

Get a vcf file for it using our most recent release

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/

and the data slicer

http://browser.1000genomes.org/Homo_sapiens/UserData/SelectSlice

The if small enough (<750 variants) you can use the web interface to the Variant Effect Predictor

http://www.ensembl.org/Homo_sapiens/UserData/UploadVariations

Alternatively you can use the script

http://www.ensembl.org/info/docs/variation/vep/vep_script.html

Please use the most recent vcf files, this will be much more accurate than any old data set

ADD COMMENT • link 12.6 years ago by Laura ★ 1.8k

0

Entering edit mode

Thanks for reply, Laura. While this seems the straightforward approach to take, the number of genes I want to analyze has been increasing - so I would like to find out an automated way that is able to efficiently process hundreds of genes across all chromosomes.

ADD REPLY • link 12.6 years ago by Halit ▴ 90

score 3 · Answer 3 · 2012-04-21

3

Entering edit mode

12.6 years ago

Sean Davis 27k

You could download the VCF from 1000G and then running ANNOVAR, ensembl variant effect predictor, or snpEff.

ADD COMMENT • link 12.6 years ago by Sean Davis 27k

0

Entering edit mode

Thanks, Sean. I downloaded Annovar and will check out in an hour or so - let's see if it's any good for my problem.

ADD REPLY • link 12.6 years ago by Halit ▴ 90