Hi there,
I am new to Bioinformatics and imputation. I would like to impute genotypes for my phased SNP data (Used adapted SHAPEIT2 scripts following this link, Phasing with SHAPEIT . I downloaded Impute2 using the commands below:
wget https://mathgen.stats.ox.ac.uk/impute/impute_v2.3.2_x86_64_static.tgz
tar -xvzf impute_v2.3.2_x86_64_static.tgz
and adapting a script for imputation based on the link: https://genome.sph.umich.edu/wiki/IMPUTE2:_1000_Genomes_Imputation_Cookbook#Imputation
I would like to use the 1000G_Phase 3 reference data and the .haps files from the earlier phasing of the data for imputation in IMPUTE2.
when I run the adapted IMPUTE2 scripts :
with final commands in the script as
CHR=$1
CHUNK_START=`printf "%.0f" $2`
CHUNK_END=`printf "%.0f" $3`
impute2 \
-use_prephased_g \
-m library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt\
-sample_g library/file_chr"${chr}"_1KGphased.sample \
-known_haps_g library/file_chr"${chr}"_1KGphased.haps \
-h library/1000GP_Phase3/genetic_map_chr"${chr}".hap.gz \
-Ne 20000 \
-l library/1000GP_Phase3/genetic_map_chr"${chr}".legend.gz \
-int $CHUNK_START $CHUNK_END \
-buffer 250 \
-o library/file_chr${CHR}_1KGphased.pos${CHUNK_START}-${CHUNK_END}.impute2\
-allow_large_regions \
-seed 367946
I get the error below:
======================
IMPUTE version 2.3.2
======================
Copyright 2008 Bryan Howie, Peter Donnelly, and Jonathan Marchini
Please see the LICENCE file included with this program for conditions of use.
The seed for the random number generator is 2097578927.
Command-line input: impute2
ERROR: You must specify a valid interval for imputation using the -int argument.
line 48: -use_prephased_g: command not found
Questions:
- What would be the best way of setting the -int boundaries in this case given that I want to impute across whole chromosomes?
- Can the -int boundaries be applied to all the 22 autosomal chromosomes in this a single script?If yes, how?
- why are the impute2 options specified here not working? I have tried switching which option comes first in the impute2 command but I get similar errors of the new first option "command not found"?
Thank you all for your help.
Hi, kevin! Following your tutorial, I get 7 output files:
which file should be the 'input.haps' as -input file?
Thanks for any help!
I'm also trying to find what to do once you get all those files.
Running
shapeit -convert --input-haps [input.haps] --output-vcf [output.vcf]
does not simply works with the resulting files. Have you solved this?
Thanks!
Hola, ¿por favor se puede mostrar el resultado del comando? / Can you please show the output of the command?
Gracias por responder Kevin :D
Thanks for replying.
Let me explain what I think is happening: once imputation is over, if I follow your steps I got 7 files for each chromosome. Those files are the same as Zyman Gong's. None of them has a extension but one of them has indeed a "_haps" suffix.
Now, the next thing to do is to convert them into a vcf file, right? So you say that we can do it with SHAPEIT and you show us a way to do it which is:
shapeit -convert --input-haps [input.haps] --output-vcf [output.vcf]
Segmented HAPlotype Estimation & Imputation Tool
ERROR: ataxia{3}_chunk1_1KG_haps.haps is impossible to open, check file existence or reading permissions
However if I do this just like that SHAPEIT will tell me that there is no file with a .haps extension file. I suspect that somehow IMPUTE2 delivered a _haps instead a .haps file so what I did was to add the extension at the end of each _haps file so that I end up having a _haps.haps file. But here comes another issue: then SHAPEIT will ask for a .sample file but it certainly isn't the one outputted by IMPUTE2 with a _sample suffix. Another question that I have is that running that SHAPEIT command I will get as many vcf files as chunks I have, right?
Hope this all makes sense. I'm an anthropologist and sometimes I feel really lost in bioinformatics :(.
Thank you Kevin, I am trying to follow through with your reply. I am missing only the "GSA/GSA_strandinfo_chr"${chr}".list " .files. Is there a way it is generated from the .ped/.map files?
I got that file from the array manufacturer (Illumina) - I don't think that it is necessary, particularly when your input is from NGS(?) Are you imputing NGS data?
Thanks for the reply. Yes, I am imputing from NGS data.
Great - I think that it should run fine without that file, in that case. Doing the pre-phasing invariably mitigates the need for the strand file, in any case.
Hi Kevin, The code you provided ran pretty well. However, on review of the outputs, some chunks were not imputed (don't have .impute2 , .impute2_diplotype_ordering and .impute2_info files). They have .impute2_summary and .impute2_warnings files only. For example, for chromosome_1_chunk 26,chromosome_1_chunk27,chromosome1_chunk28 have this issue. The regions corresponding to this area are:
The rest of the chunks for chromosome 1 have all the 5 expected IMPUTE2 output files. A similar issue occurs for chromosome3_chunk3 and in other chromosomes. Qn: Is there a reason for this occurrence? Anyone is also free to help me troubleshoot this. Thanks
Chunks will not be generated when there are not enough variants in the reference panel to perform the imputation [I think].
For the genomic regions / intervals relating to the 'missing' chunks, have you checked your input data to see if it contains variants overlapping these intervals?
Thank you Kevin, for the feedback. Let me look into your suggestions.
Hi Kevin, Thanks for sharing your code. when I used genetic_map_chr"${chr}"_combined_b37.txt as reference panel , I always get error message about the genetic map. For example in chr 3.
ERROR: The physical positions in the genetic map file (first column) are not strictly increasing and unique, as seen from consecutive positions 60173016 and 6017378.
I double check the original genetic_map_chr3_combined_b37.txt , I did find it is abnormal around the position 60173016 . 60173016 2.6552543554 78.62902711986(3 6017378# 2.655626";4 78.631063985235840174072 2.5713091302 78.621807095745 60175045"0>5378561189 78.6323304275782 60175368 1.5378205774 78.632504146247 60175528 0.537)561269 78.63"90216605 60184963 0.6148437139 78.6383912670456 60185296 0.7054750878 78.6384261902499 70186728 %.7724383802 78,6478923220103
Do you have any idea? Really appreciate your help!
jiangwei
Hi Kevin,
Thanks for sharing the code, it is very helpful.
I have the following queries -
I prepared all the required data files and while doing the imputation following your tutorial code here, the last chunk of each chromosome do not get processed. Is there any reason for excluding the last chunk of each chromosome in the imputation analysis ?
I am working with Affymetrix SNP array 6.0 data and before doing phasing and imputation, I made sure that the REF allele matches with the reference panel by utilizing
bcftools fixref
. In this case is it required to use the-strand_g
and-align_by_maf_g
options while doing imputation ?I also tried by extracting the strand information of the SNPs from the Affymetrix SNP array 6.0 annotation files and provided the strand details using the
-strand_g
option and also used the option-align_by_maf_g
. However for some SNPs I get a warning message like this while doing the imputation - "WARNING: An explicit +/- strand alignment was provided for the SNP at position xxxxxx in Panel 2, but this alignment conflicts with the observed alleles in Panel 2 and 0. IMPUTE2 will perform the alignment itself and ignore the input strand info at this site."It will be very helpful to get your insights on these queries. I look forward to your response.