Mask a SNP before imputation
0
1
Entering edit mode
3.5 years ago
raalsuwaidi ▴ 100

hello all,

i am still experimenting with imputation in bioinformatics. at the current time i am using IMPUTE2.

my question is this,

how do i mask a SNP?

i see some research that says to use X or N? is this in the VCF or in the fasta file? will it work with IMPUTE2 after i generate the GEN file to be imputed?

isn't it easier to use BCFtools to filter out some SNPs ?would that be considered masking?

i am very sorry for all the questions but i couldn't find much answers so far.

genotype bcftools impute2 mask VCF • 1.8k views
ADD COMMENT
1
Entering edit mode

First off - don't use impute2 - it's nearly 10 years old. Please use a newer method such as IMPUTE5 or QUILT. There's a reason people spend a lot of time making new methods :) They are much faster and generally more convenient to use and more memory efficient. Second - why do you want to 'mask' SNPs - do you want to leave some positions unaffected by the imputation?

ADD REPLY
1
Entering edit mode

Thank you so much for the recommendation.

As for why o want to mask SNPs it’s to compare the imputed value to the ground truth. As a method to show if it’s effective

ADD REPLY
1
Entering edit mode

Ah right - so you want to set the some SNPS as missing, then impute them and then compare the missing to the truth? If so, I will write an answer underneath.

ADD REPLY
0
Entering edit mode

this is exactly what i need :)

and i am working on getting IMPUTE5 as you advised as well. too bad it doesn't come with anaconda libraries

ADD REPLY
0
Entering edit mode

I got access to impute5. Waiting for your answer

ADD REPLY
1
Entering edit mode

Masking is the correct word for it. This is the how-to:

Step 1: You will have the original genotype (from sequence level perhaps), use VCFtools/BCFtools to subset the starting SNPs to impute from (this is the remaining set when you subsetting off the mask set).

Step 2: In parallel, repeat step 1, but this time keep the rest (the mask set) as a separate file. This is your TRUTH set so to speak.

Step 3: Use the genotype in Step 1 to impute to your reference panel (with IMPUTE5 for example). Then again use VCFtools/BCFtools on this result to extract the mask set. This is your IMPUTE set so to speak.

Step 4: Use a comparison program (SnpSift or GATK GenotypeConcordance for example) to compare between the IMPUTE & TRUTH sets. You can then calculate concordance, correlation, imputation error rate.... etc.

Hope this helps.

ADD REPLY
0
Entering edit mode

this is very helpful, thank you so much

ADD REPLY

Login before adding your answer.

Traffic: 1632 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6