ERROR: You must specify a valid interval for imputation using the -int argument, -use_prephased_g: command not found, in IMPUTE2
1
1
Entering edit mode
4.8 years ago
jmukisa90 ▴ 30

Hi there,

I am new to Bioinformatics and imputation. I would like to impute genotypes for my phased SNP data (Used adapted SHAPEIT2 scripts following this link, Phasing with SHAPEIT . I downloaded Impute2 using the commands below:

  wget https://mathgen.stats.ox.ac.uk/impute/impute_v2.3.2_x86_64_static.tgz
  tar -xvzf impute_v2.3.2_x86_64_static.tgz

and adapting a script for imputation based on the link: https://genome.sph.umich.edu/wiki/IMPUTE2:_1000_Genomes_Imputation_Cookbook#Imputation

I would like to use the 1000G_Phase 3 reference data and the .haps files from the earlier phasing of the data for imputation in IMPUTE2.

when I run the adapted IMPUTE2 scripts :

with final commands in the script as

CHR=$1
CHUNK_START=`printf "%.0f" $2`
CHUNK_END=`printf "%.0f" $3`
impute2      \
    -use_prephased_g \
    -m library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt\
    -sample_g library/file_chr"${chr}"_1KGphased.sample \
    -known_haps_g  library/file_chr"${chr}"_1KGphased.haps \
    -h  library/1000GP_Phase3/genetic_map_chr"${chr}".hap.gz \
    -Ne 20000 \
    -l library/1000GP_Phase3/genetic_map_chr"${chr}".legend.gz \
     -int $CHUNK_START $CHUNK_END \
    -buffer 250 \
    -o library/file_chr${CHR}_1KGphased.pos${CHUNK_START}-${CHUNK_END}.impute2\
    -allow_large_regions \
    -seed 367946

I get the error below:

======================
 IMPUTE version 2.3.2
======================

Copyright 2008 Bryan Howie, Peter Donnelly, and Jonathan Marchini
Please see the LICENCE file included with this program for conditions of use.

The seed for the random number generator is 2097578927.

Command-line input: impute2

ERROR: You must specify a valid interval for imputation using the -int argument.
line 48: -use_prephased_g: command not found

Questions:

  1. What would be the best way of setting the -int boundaries in this case given that I want to impute across whole chromosomes?
  2. Can the -int boundaries be applied to all the 22 autosomal chromosomes in this a single script?If yes, how?
  3. why are the impute2 options specified here not working? I have tried switching which option comes first in the impute2 command but I get similar errors of the new first option "command not found"?

Thank you all for your help.

software error impute2 • 5.3k views
ADD COMMENT
3
Entering edit mode
4.7 years ago

Edit June 7, 2020:

The code below is for phased imputation using the output of SHAPEIT2 and ultimate production of phased VCFs. For the initial pre-phasing process with SHAPEIT2, see my answer here: Phasing with SHAPEIT

So, the steps are usually:

  1. pre-phasing into pre-existing haplotypes available from HERE ( Phasing with SHAPEIT )
  2. phased imputation and generation of phased VCFs ( ERROR: You must specify a valid interval for imputation using the -int argument, -use_prephased_g: command not found, in IMPUTE2 )


Hi,

I gave the answer in the other thread, regarding the pre-phasing of data using SHAPEIT2 ( Phasing with SHAPEIT ). I can see that you are now a different user (?) who is doing the next step, i.e., the imputation, using the pre-phased haplotypes?

Unless you have a stick of RAM that's the size of the Sun, you will indeed have to do the imputation in chunks. You also need to therefore know the lengths of your chromosomes. Basically, this can be achieved via shell scripting. Here is how I did it for interrval ('chunk') sizes of 5 megabase (5 million bases):

for chr in {1..22}; do
  case "${chr}" in
    1)
      max=249250621
    ;;
    2)
      max=243199373
    ;;
    3)
      max=198022430
    ;; 
    4)
      max=191154276
    ;;
    5)
      max=180915260
    ;;
    6)
      max=171115067
    ;;
    7)
      max=159138663
    ;;
    8)
      max=146364022
    ;;
    9)
      max=141213431
    ;;
    10)
      max=135534747
    ;;
    11)
      max=135006516
    ;;
    12)
      max=133851895
    ;;
    13)
      max=115169878
    ;;
    14)
      max=107349540
    ;;
    15)
      max=102531392
    ;;
    16)
      max=90354753
    ;;
    17)
      max=81195210
    ;;
    18)
      max=78077248
    ;;
    20)
      max=63025520
    ;;
    19)
      max=59128983
    ;;
    22)
      max=51304566
    ;;
    21)
      max=48129895
    ;;
  esac

  chunk=1 ;
  interval=5000000 ;
  start=0 ;
  end="${interval}" ;

  while [ $end -lt $max ] ;
  do
    srun --mem=32 --cpus-per-task=32 --partition=serial \
      impute \
        -phase \
        \
        -use_prephased_g \
        -known_haps_g Prephased/GSA_QCd_chr"${chr}"_1KGphased.haps \
        -strand_g GSA/GSA_strandinfo_chr"${chr}".list \
        \
        -m library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
        \
        -h library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz \
        -l library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz \
        \
        -align_by_maf_g \
        -int $((start+1)) "${end}" \
        -Ne 20000 \
        -o Imputed_Phased/GSA_chr"${chr}"_chunk"${chunk}"_1KG ;

    start=$(($start+$interval)) ;
    end=$(($end+$interval)) ;
    chunk=$(($chunk+1)) ;

    echo "${chr}" "${start}" "${end}" "${chunk}" ;
  done ;
done ;

I got the chromosome lengths from the fai file that's produced from samtools faidx for the GRCh37 1000 Genomes FASTA reference genome. You can see the link for this genome in step 3, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

Also note that I add the -phase parameter, which will perform a phased imputation. With your code, an un-phased imputation will be performed. Some of your other parameters differ from mine, so, please check those.

Once your imputation is complete, you can convert the resulting haps files to vcf via:

shapeit -convert --input-haps [input.haps] --output-vcf [output.vcf]

After that, you'll need BCFtools commands to piece your data back together, and more time and RAM.

Trust that this assists you.

Kevin

ADD COMMENT
1
Entering edit mode

Hi, kevin! Following your tutorial, I get 7 output files:

Output files

           main output: imputed_chr1.phased.impute2
           SNP QC info: imputed_chr1.phased.impute2_info
         sample QC info: imputed_chr1.phased.impute2_info_by_sample
            run summary: imputed_chr1.phased.impute2_summary
            warning log: imputed_chr1.phased.impute2_warnings
   Panel 2 phased haps: imputed_chr1.phased.impute2_haps
   Panel 2 allele probs: imputed_chr1.phased.impute2_allele_probs

which file should be the 'input.haps' as -input file?

shapeit -convert --input-haps [input.haps] --output-vcf [output.vcf]

Thanks for any help!

ADD REPLY
0
Entering edit mode

I'm also trying to find what to do once you get all those files.

Running

shapeit -convert --input-haps [input.haps] --output-vcf [output.vcf]

does not simply works with the resulting files. Have you solved this?

Thanks!

ADD REPLY
0
Entering edit mode

Hola, ¿por favor se puede mostrar el resultado del comando? / Can you please show the output of the command?

ADD REPLY
0
Entering edit mode

Gracias por responder Kevin :D

Thanks for replying.

Let me explain what I think is happening: once imputation is over, if I follow your steps I got 7 files for each chromosome. Those files are the same as Zyman Gong's. None of them has a extension but one of them has indeed a "_haps" suffix.

Now, the next thing to do is to convert them into a vcf file, right? So you say that we can do it with SHAPEIT and you show us a way to do it which is:

shapeit -convert --input-haps [input.haps] --output-vcf [output.vcf]

$ shapeit \
  -convert \
  --input-haps ataxia{3}_chunk1_1KG_haps \
  --output-vcf ataxia_chunk1.vcf

Segmented HAPlotype Estimation & Imputation Tool

ERROR: ataxia{3}_chunk1_1KG_haps.haps is impossible to open, check file existence or reading permissions

However if I do this just like that SHAPEIT will tell me that there is no file with a .haps extension file. I suspect that somehow IMPUTE2 delivered a _haps instead a .haps file so what I did was to add the extension at the end of each _haps file so that I end up having a _haps.haps file. But here comes another issue: then SHAPEIT will ask for a .sample file but it certainly isn't the one outputted by IMPUTE2 with a _sample suffix. Another question that I have is that running that SHAPEIT command I will get as many vcf files as chunks I have, right?

Hope this all makes sense. I'm an anthropologist and sometimes I feel really lost in bioinformatics :(.

ADD REPLY
0
Entering edit mode

Thank you Kevin, I am trying to follow through with your reply. I am missing only the "GSA/GSA_strandinfo_chr"${chr}".list " .files. Is there a way it is generated from the .ped/.map files?

ADD REPLY
0
Entering edit mode

I got that file from the array manufacturer (Illumina) - I don't think that it is necessary, particularly when your input is from NGS(?) Are you imputing NGS data?

ADD REPLY
0
Entering edit mode

Thanks for the reply. Yes, I am imputing from NGS data.

ADD REPLY
0
Entering edit mode

Great - I think that it should run fine without that file, in that case. Doing the pre-phasing invariably mitigates the need for the strand file, in any case.

ADD REPLY
0
Entering edit mode

Hi Kevin, The code you provided ran pretty well. However, on review of the outputs, some chunks were not imputed (don't have .impute2 , .impute2_diplotype_ordering and .impute2_info files). They have .impute2_summary and .impute2_warnings files only. For example, for chromosome_1_chunk 26,chromosome_1_chunk27,chromosome1_chunk28 have this issue. The regions corresponding to this area are:

chr  start            end          chunk

1 130000000 135000000 26

1 135000000 140000000 27

1 140000000 145000000 28.

The rest of the chunks for chromosome 1 have all the 5 expected IMPUTE2 output files. A similar issue occurs for chromosome3_chunk3 and in other chromosomes. Qn: Is there a reason for this occurrence? Anyone is also free to help me troubleshoot this. Thanks

ADD REPLY
0
Entering edit mode

Chunks will not be generated when there are not enough variants in the reference panel to perform the imputation [I think].

For the genomic regions / intervals relating to the 'missing' chunks, have you checked your input data to see if it contains variants overlapping these intervals?

ADD REPLY
1
Entering edit mode

Thank you Kevin, for the feedback. Let me look into your suggestions.

ADD REPLY
0
Entering edit mode

Hi Kevin, Thanks for sharing your code. when I used genetic_map_chr"${chr}"_combined_b37.txt as reference panel , I always get error message about the genetic map. For example in chr 3.

ERROR: The physical positions in the genetic map file (first column) are not strictly increasing and unique, as seen from consecutive positions 60173016 and 6017378.

I double check the original genetic_map_chr3_combined_b37.txt , I did find it is abnormal around the position 60173016 . 60173016 2.6552543554 78.62902711986(3 6017378# 2.655626";4 78.631063985235840174072 2.5713091302 78.621807095745 60175045"0>5378561189 78.6323304275782 60175368 1.5378205774 78.632504146247 60175528 0.537)561269 78.63"90216605 60184963 0.6148437139 78.6383912670456 60185296 0.7054750878 78.6384261902499 70186728 %.7724383802 78,6478923220103

Do you have any idea? Really appreciate your help!

jiangwei

ADD REPLY
0
Entering edit mode

Hi Kevin,

Thanks for sharing the code, it is very helpful.

I have the following queries -

  1. I prepared all the required data files and while doing the imputation following your tutorial code here, the last chunk of each chromosome do not get processed. Is there any reason for excluding the last chunk of each chromosome in the imputation analysis ?

  2. I am working with Affymetrix SNP array 6.0 data and before doing phasing and imputation, I made sure that the REF allele matches with the reference panel by utilizing bcftools fixref. In this case is it required to use the -strand_g and -align_by_maf_g options while doing imputation ?

  3. I also tried by extracting the strand information of the SNPs from the Affymetrix SNP array 6.0 annotation files and provided the strand details using the -strand_g option and also used the option -align_by_maf_g. However for some SNPs I get a warning message like this while doing the imputation - "WARNING: An explicit +/- strand alignment was provided for the SNP at position xxxxxx in Panel 2, but this alignment conflicts with the observed alleles in Panel 2 and 0. IMPUTE2 will perform the alignment itself and ignore the input strand info at this site."

It will be very helpful to get your insights on these queries. I look forward to your response.

ADD REPLY

Login before adding your answer.

Traffic: 1691 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6