Entering edit mode
3.4 years ago
raalsuwaidi
▴
100
hello all,
i am still experimenting with imputation in bioinformatics. at the current time i am using IMPUTE2.
my question is this,
how do i mask a SNP?
i see some research that says to use X or N? is this in the VCF or in the fasta file? will it work with IMPUTE2 after i generate the GEN file to be imputed?
isn't it easier to use BCFtools to filter out some SNPs ?would that be considered masking?
i am very sorry for all the questions but i couldn't find much answers so far.
First off - don't use impute2 - it's nearly 10 years old. Please use a newer method such as IMPUTE5 or QUILT. There's a reason people spend a lot of time making new methods :) They are much faster and generally more convenient to use and more memory efficient. Second - why do you want to 'mask' SNPs - do you want to leave some positions unaffected by the imputation?
Thank you so much for the recommendation.
As for why o want to mask SNPs it’s to compare the imputed value to the ground truth. As a method to show if it’s effective
Ah right - so you want to set the some SNPS as missing, then impute them and then compare the missing to the truth? If so, I will write an answer underneath.
this is exactly what i need :)
and i am working on getting IMPUTE5 as you advised as well. too bad it doesn't come with anaconda libraries
I got access to impute5. Waiting for your answer
Masking is the correct word for it. This is the how-to:
Step 1: You will have the original genotype (from sequence level perhaps), use VCFtools/BCFtools to subset the starting SNPs to impute from (this is the remaining set when you subsetting off the mask set).
Step 2: In parallel, repeat step 1, but this time keep the rest (the mask set) as a separate file. This is your TRUTH set so to speak.
Step 3: Use the genotype in Step 1 to impute to your reference panel (with IMPUTE5 for example). Then again use VCFtools/BCFtools on this result to extract the mask set. This is your IMPUTE set so to speak.
Step 4: Use a comparison program (SnpSift or GATK GenotypeConcordance for example) to compare between the IMPUTE & TRUTH sets. You can then calculate concordance, correlation, imputation error rate.... etc.
Hope this helps.
this is very helpful, thank you so much