Question

plink : Batch effect issues after merge of two datasets

0

Entering edit mode

3.7 years ago

Nicolas Rosewick 11k

Hi,

I merged two plink dataset using

# take only SNP present in both datasets
plink --keep-allele-order --bfile dataset_A --extract snp_in_common.txt --make-bed --out dataset_A_common
plink --keep-allele-order --bfile dataset_B --extract snp_in_common.txt --make-bed --out dataset_B_common

echo dataset_A_common > merge.txt
echo dataset_B_common >> merge.txt

# merge datasets
plink --merge-list merge.txt --make-bed --out dataset_merge

# filter out SNP with low freq and low genotyping rate 
plink --maf 0.01 --geno 0.05 --hwe 0.00001 --bfile dataset_merge --out dataset_merge

I perfomed a PCA (after pruning the merged dataset)

# pruning
 plink --bfile dataset_merge --exclude high-ld-regions.txt --range --indep-pairwise 50 5 0.2 --out dataset_merge
 plink --bfile datase_merge dataset_merge.prune.in --make-bed --out dataset_merge_pruned

 # pca
 plink --pca --bfile dataset_merge_pruned --out dataset_merge_pruned

When I plot PCA shows clearly a strong batch effect between both datasets

enter image description here

I continued the analysis by performing a logistic :

plink --bfile dataset_merge --covar pca_file.txt --covar-name PC1,PC2 --logistic --out dataset_merge

Looking at the manhattan and p-value histogram, there is clearly something not correct ... most of p-values are close to 1..

enter image description here

Any idea how to solve this ?

Thank you

P.S. : I also posted this on plink google group. Sorry for the cross post. I can remove this post if needed..

plink merge • 2.5k views

ADD COMMENT • link updated 2.9 years ago by mkasan • 0 • written 3.7 years ago by Nicolas Rosewick 11k

0

Entering edit mode

I am not an expert at all but it looks like PCs explain your dataset perfectly so mutations do not matter anymore. Thus, taking PCs as covariates, mutations are not needed for discrimination anymore. Maybe PC1 does not actually segregate between these 2 datasets and only PC2 is enough to correct for? Sorry if I said something not so smart. How does a PCA of 0/1s look? If it separates well via PC1 - then correcting for PC1 kills all the meaning in mutations...

ADD REPLY • link 3.7 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

Here adjusting for PC is to take into account population stratification. I would expect not to have such big discrimination between both datasets as both are based on germline data (already pre filtered for caucasian ancestry)

ADD REPLY • link 3.7 years ago by Nicolas Rosewick 11k

0

Entering edit mode

I would say (from my experience) such a great difference can be explained by different enrichment kits, used to generate 2 datasets. The EUR population PCA usually looks like an angle |_ - so the picture that you have is not typical for population separation, more for a technical batch. But what's the most important is how your cases and controls are distributed across this merged dataset. I'd depict them as different colors, I bet since you have such large p-values cases/controls are distributed across batches along PC1 line - thus, PC1 already explains the case/control separation and there is no variance to be explained by mutations remained.

ADD REPLY • link 3.7 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

Thanks German.M.Demidov . One important piece of information I miss in my main thread is that dataset_A are the cases ; and dataset_B are the controls in my case/control logisitic analysis. I'm still struggling to understand why the p-value distribution is so skewed towards 1 ( I thought p-values should be uniformly distributed under the Null ). Thanks

ADD REPLY • link 3.7 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Oh, then it will be problematic. The logistic regression looks if a frequency difference in mutation X can discriminate between cases and controls. But when it is given PC2 as a covariate it does not need the mutation X at all to discriminate cases and controls, it says "everything with PC2 > 0 is a case, everything less is a control". It is already enough for the logistic regression. Thus, p-values are shifted towards 1 because PC2 already separates 2 sets and no mutation is needed!

I am afraid this is the situation without a good solution. If cases come from one population and controls from another, there is no way to distuinguish real case/control differences from population differences...

ADD REPLY • link 3.7 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

Hello, I'm facing a similar issue with my SNPdata. I have 2 batches from different time points, a control sample has been included in both batches. I have already merged them using PLINK. Performing PCA I found out my control sample positioned far from each other - indicating the batch effect. PCA using eigenvectors created from PLINK Eigenvector values of same sample from different batches look like,

batch cell_name V1 V2

2 Hela_1 -0,341 0,005

1 Hela_1 -0,209 0,046

How can I position them on top of each other ? Do you know any R packages/PLINK commands to solve this? I´d appreciate the help.

Cheers,

Merve

ADD REPLY • link 2.9 years ago by mkasan • 0