comparison of exome data to the 1000Genomes WGS data
0
0
Entering edit mode
7.0 years ago
gabili • 0

I’m trying to incorporate SNP data from 1000Genomes into my exome data. Since there are no available exome VCF’s, I downloaded the 1000Genomes whole genome sequence data and then just filtered it according to the genomic positions of my variants (obtained from the PLINK bim file). My data is referenced to hg19, so i used the GRCh37 version of the 1000Genomes that is found at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ ("ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", etc.). However, when I compared the 2 datasets (using PLINK 1.9 to open and filter the VCF's), I was surprised to find only ~25% of my exome variants in the big 1000Genomes WGS (for example: I have 300,000 SNPs in chromosome 1, but only 80,000 of them were found in the 1000Genomes WGS chromosome 1 file). When I used the "exome pull down targets" data (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/exome_pull_down_targets/)  to focus my search, I got  very similar results.  I was looking for some differences, but 75% "missingness" seems not right.  Any suggestions?

1000Genomes PLINK SNP EXOME • 2.7k views
ADD COMMENT
0
Entering edit mode

What kind of data do you have? Disease or healthy donor. If disease, are there matched normals?

ADD REPLY
0
Entering edit mode

My data contains 6500 people with type-2 diabetes and 6500 people without. It was used in “The genetic architecture of type 2 diabetes" , Nature 536 2016 (https://www.nature.com/articles/nature18642).

ADD REPLY
0
Entering edit mode

300,000 SNPs on chr1 for an exome capture data set? We usually have <100,000 confident exonic SNPs across the whole-genome for whole-exome sequencing. What is the on-target rate and the fraction of targets >=30x?

ADD REPLY
0
Entering edit mode

Hi, Unfortunately I didn’t take part in the creation of the dataset, I’m just a simple “end user”. The data is part of the international type-2 diabetes consortium, and was used in “The genetic architecture of type 2 diabetes" , Nature 536 2016 (https://www.nature.com/articles/nature18642). These 2 paragraphs are from the methods section of the paper, and maybe they can help:

“...Exome sequencing. Genomic DNA was sheared, end repaired, ligated with barcoded Illumina sequencing adapters, amplified, size selected, and subjected to in-solution hybrid capture using the Agilent SureSelect Human All Exon 44Mb v2.0 (DGI, FUSION, UK2T2D) and v3.0 (KORA) bait set (Agilent Technologies, USA). Resulting Illumina exome sequencing libraries were qPCR quantified, pooled, and sequenced with 76-bp paired-end reads using Illumina GAII or HiSeq 2000 sequencers to ~82-fold mean coverage...”

“...Coverage and QC of aligned sequence reads.We excluded 151 exome samples with average coverage ≤20× in >20% of the target bases and 68 genome samples with average coverage ≤5×….”

ADD REPLY
0
Entering edit mode

Thanks, so this is a population VCF with >10,000 samples? Then the callset is dominated by rare alleles such as singletons and doubletons of allele count 1 and 2. If you subset your VCF to common variants (MAF>1%) you will have a large intersection with 1000 Genomes and for the rare ones it is no surprise that many are absent in 1000 Genomes.

ADD REPLY

Login before adding your answer.

Traffic: 1859 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6