Merging WGS SNP array data
1
0
Entering edit mode
6.2 years ago
sankar2004 ▴ 60

Hi

I combined SNP array and WGS data and plotted a PCA. I found that the individuals from the same population did not cluster together because the SNVs were obtained from the two different (genotyping) methods mentioned above. How do you remove this bias/discrepancy from these datasets.

Thanks for your help

SNP • 2.5k views
ADD COMMENT
1
Entering edit mode

Thanks a lot Kevin. I will try that

ADD REPLY
0
Entering edit mode
6.2 years ago

With your array data, you will have to filter out variants not called on the coding (+ / plus) strand, and then also filter these out of the NGS WGS data. Take a look at what I have written for Step 6, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

You can download the library files from the array manufacturer for the purposes of determining on which strand each probe genotypes.

Kevin

ADD COMMENT
1
Entering edit mode

Instead of removing all snps referencing opposite strands between array and WGS before merge, couldn't you just try to harmonize strand with a tool like [genotype harmonizer] or [conform g-t]?

Regardless, I am also wondering if it is good practice to merge WGS and snp array at all if avoidable. For example lthis study merged some older studies with only snp array calls available with array-based calls for 1000 genomes, even when 1000 genomes sequence data could have been used instead.

ADD REPLY
1
Entering edit mode

Yes, one can certainly try that. It's just a fair bit of extra effort.

ADD REPLY
0
Entering edit mode

Appreciate it. Do people generally avoid combining merging sequence and array data if it can be avoided? For example in cases where one of study has both array and sequence data and the other has only array. Would most people just go for the array-only merge approach?

ADD REPLY
0
Entering edit mode

I don't know, to be honest, but I imagine that most people are not in the habit of merging both datatypes, unless it's to do something like in my tutorial (comparing old array genotyping data against the more modern NGS-derived 1000 Genomes dataset to infer ethnicity), or, of course, to perform imputation of old array datasets.

ADD REPLY
0
Entering edit mode

I would think thats the case. I was just going to compare AF between array/sequencing and array/array in addition to PCA. I would expect the difference between genotyping technologies will be larger between samples of same ancestry across technologies and its easy enough to do

ADD REPLY

Login before adding your answer.

Traffic: 2600 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6