Question

Mask a SNP before imputation

1

Entering edit mode

3.4 years ago

raalsuwaidi ▴ 100

hello all,

i am still experimenting with imputation in bioinformatics. at the current time i am using IMPUTE2.

my question is this,

how do i mask a SNP?

i see some research that says to use X or N? is this in the VCF or in the fasta file? will it work with IMPUTE2 after i generate the GEN file to be imputed?

isn't it easier to use BCFtools to filter out some SNPs ?would that be considered masking?

i am very sorry for all the questions but i couldn't find much answers so far.

genotype bcftools impute2 mask VCF • 1.8k views

ADD COMMENT • link 3.2 years ago by raalsuwaidi ▴ 100

1

Entering edit mode

First off - don't use impute2 - it's nearly 10 years old. Please use a newer method such as IMPUTE5 or QUILT. There's a reason people spend a lot of time making new methods :) They are much faster and generally more convenient to use and more memory efficient. Second - why do you want to 'mask' SNPs - do you want to leave some positions unaffected by the imputation?

ADD REPLY • link 3.4 years ago by 4galaxy77 2.9k

1

Entering edit mode

Thank you so much for the recommendation.

As for why o want to mask SNPs it’s to compare the imputed value to the ground truth. As a method to show if it’s effective

ADD REPLY • link 3.4 years ago by raalsuwaidi ▴ 100

1

Entering edit mode

Ah right - so you want to set the some SNPS as missing, then impute them and then compare the missing to the truth? If so, I will write an answer underneath.

ADD REPLY • link 3.4 years ago by 4galaxy77 2.9k

0

Entering edit mode

this is exactly what i need :)

and i am working on getting IMPUTE5 as you advised as well. too bad it doesn't come with anaconda libraries

ADD REPLY • link 3.4 years ago by raalsuwaidi ▴ 100

0

Entering edit mode

I got access to impute5. Waiting for your answer

ADD REPLY • link 3.4 years ago by raalsuwaidi ▴ 100

1

Entering edit mode

Masking is the correct word for it. This is the how-to:

Step 1: You will have the original genotype (from sequence level perhaps), use VCFtools/BCFtools to subset the starting SNPs to impute from (this is the remaining set when you subsetting off the mask set).

Step 2: In parallel, repeat step 1, but this time keep the rest (the mask set) as a separate file. This is your TRUTH set so to speak.

Step 3: Use the genotype in Step 1 to impute to your reference panel (with IMPUTE5 for example). Then again use VCFtools/BCFtools on this result to extract the mask set. This is your IMPUTE set so to speak.

Step 4: Use a comparison program (SnpSift or GATK GenotypeConcordance for example) to compare between the IMPUTE & TRUTH sets. You can then calculate concordance, correlation, imputation error rate.... etc.

Hope this helps.

ADD REPLY • link 3.3 years ago by tuan.vietnguyen90 ▴ 10

0

Entering edit mode

this is very helpful, thank you so much

ADD REPLY • link 3.2 years ago by raalsuwaidi ▴ 100