ShapeIT check problematic SNPs
0
0
Entering edit mode
3.6 years ago
SHN ▴ 40

Hello All,

I am using shapeit to phase my genotype data using 1000G phase3 as a reference genome. In order to phase my genotypes, I am using SHAPEIT. in the first step, I am using the command shapeIT to remove the problematic SNPs as has been mentioned in the link below: Phasing with SHAPEIT

However, when in ran the second round of SHAPEIT -check to exclude the problematic variants using --exclude-snp Prephased/MyData_chr"${chr}"_alignments.snp.strand.exclude (the link above), I still get error and files with alignment.snp.strand.exclude.

So here are my questions:

  • Is it correct not to get all the problematic SNPs/variants in the first round of shapeit --check?
  • Should I continue this step (below) again to remove the newly found problematic SNPs/variants until I get no error?

Thank you for your responses,

for chr in X {1..22}; do 

  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]

  then

    srun --mem=8 --cpus-per-task=4 --partition=serial \

      shapeit \

        -check \

        -B temp \

        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \

        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \

        --exclude-snp Prephased/MyData_chr"${chr}"_alignments.snp.strand.exclude \
        -T 8 ;
  fi

done ;

rm temp.* ;
genotypes imputev2 shapeit problematic • 2.1k views
ADD COMMENT
0
Entering edit mode

Hi, I wrote the code in the other threads. Can you show the first command for step 1? What is the error message(s) that you are receiving?

ADD REPLY
0
Entering edit mode

Hi Kevin,

Thanks for your message. I actually am doing the same steps. here is the code for the first step:

for i in {1..22};do
echo "* Chromosome: "$i
touch $data_path/checktemp
plink --bfile $start_file_name --chr "$i" --make-bed --bp-space 1 --out $result_unpahsed/temp  &> $data_path/checktemp && echo "  -- First step Run Checks!"
echo "*Reference haplotype: "$reference_data_path/1000GP_Phase3_chr"$i".hap.gz

if [ "$i" != "X" ]; then
    $SHAPEIT -check \
        -B $result_unpahsed/temp \
        -M $reference_data_path/genetic_map_chr"$i"_combined_b37.txt \
        --input-ref $reference_data_path/1000GP_Phase3.sample $reference_data_path/1000GP_Phase3_chr"$i".hap.gz $reference_data_path/1000GP_Phase3_chr"$i".legend.gz \
        --output-log $result_unpahsed/data_qced_chr"$i"_alignments \
        -T 8;
fi
done

After this step, I get files data_qced_chr2_alignments.snp.strand.exclude" and "data_qced_chr2_alignments.snp.strand".

I use this file "_alignments.snp.strand.exclude" for the second round of shapeit -checks and I still get missing sites with variants saved in ".snp.strand.exclude".

and the output files is as below:

MODE ESC[33m-summariseESC[0m : GENERATING SUMMARY STATISTICS OF THE INPUT DATA
  • Autosome (chr1 ... chr22)
  • Reference panel of haplotypes used

Parameters :

  • Seed : 1620163254
  • Parallelisation: 1 threads
  • Ref allele is NOT aligned on the reference genome

Reading SNPs to exclude from input file in [path/data_qced_chr5_alignments.snp.strand.exclude]

  • 96128 snps found in the exclude list

Reading site list in [PATH/qc_plink/unphased_chr/temp.bim]

  • 21882 sites included
  • 12 sites excluded

Reading sample list in [PATH/qc_plink/unphased_chr/temp.fam]

  • 328 samples included
  • 328 unrelateds / 0 duos / 0 trios in 328 different families

Reading genotypes in [/PATH/qc_plink/unphased_chr/temp.bed]

  • Plink binary file SNP-major mode

Reading sample list [/PATH/1000GP_Phase3/1000GP_Phase3_chr5.legend.gz]

  • 10587830 reference haplotypes included

Reading SNPs in [PATH/1000GP_Phase3/1000GP_Phase3_chr5.hap.gz]

  • 0 reference panel sites included
  • 5293914 reference panel sites excluded

ERROR: Reference and Main panels are not well aligned:

  • #Missing sites in reference panel = 21882
  • #Misaligned sites between panels = 0
  • #Multiple alignments between panels = 0

Should I continue excluding variants with another round of shapeit -check and then prephasing? Or are these the new variants that should be used for the prephasing step (step3 in your codes) mentioned here: Phasing with SHAPEIT

Thanks,

ADD REPLY
0
Entering edit mode

Hello again. hmmm, the --exclude-snp does not actually remove variants; so, if you run it again, it will just produce the same result.

I wonder have you done QC on your input data? Which array is it? Prior to doing the imputation, the code for which I shared here on Biostars in the other threads, I first performed QC, and then successfully pre-phased and imputed ImmunoChip and Illumina GSA again 1000 Genomes Phase III.

By QC, I mean filtering out variants with high missingness, multi-allelic records, duplicates, etc



You can also just try to proceed with the pre-phasing - not sure.

ADD REPLY
1
Entering edit mode

Thanks for your input. Yes I did the QC on the genotypes, ran shapeit -check, running step2! and I stopped here!

I thought that when we exclude variants in the second step, then the program should not produce additional variants.

My array is Infinium omni2.5, Illumina and I want to prophase and impute it against 1000G phase III.

un-alignments can affect imputation accuracy. I should probably redo everything just to make sure everything is running correctly.

ADD REPLY

Login before adding your answer.

Traffic: 1504 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6