Edit June 7, 2020:
The code below is for pre-phasing with SHAPEIT2. For phased imputation using the output of SHAPEIT2 and ultimate production of phased VCFs, see my answer here: A: ERROR: You must specify a valid interval for imputation using the -int argument,
So, the steps are usually:
- pre-phasing into pre-existing haplotypes available from HERE (
C: Phasing with SHAPEIT )
- phased imputation and generation of phaed VCFs (
A: ERROR: You must specify a valid interval for imputation using the -int argument, )
----------------
Thanks - good to know that there is now a GRCh38 version of that data! - I utilise GRCh37, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2
If your Illumina microarray is a special design, then it may only target very specific regions and have minimal overlap.
Otherwise, as mentioned, the use of SHAPEIT involves a 3 step process:
- checking overlap of your data with the reference panel (via
-check
)
- removal of problematic variants (also via
check
)
- pre-phase using filtered input data
1, first run a QC check (will throw error)
for chr in X {1..22}; do
plink --bfile MyData --chr "${chr}" --make-bed --out temp
if [ "${chr}" != "X" ]
then
srun --mem=8 --cpus-per-task=4 --partition=serial \
shapeit \
-check \
-B temp \
-M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
--input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \
--output-log Prephased/MyData_chr"${chr}"_alignments \
-T 8 ;
fi
done ;
rm temp.* ;
This should generate files with extensions _alignments.snp.strand.exclude. Use these in the next step via --exclude-snp
:
2, exclude problematic variants that were found
for chr in X {1..22}; do
plink --bfile MyData --chr "${chr}" --make-bed --out temp
if [ "${chr}" != "X" ]
then
srun --mem=8 --cpus-per-task=4 --partition=serial \
shapeit \
-check \
-B temp \
-M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
--input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \
--exclude-snp Prephased/MyData_chr"${chr}"_alignments.snp.strand.exclude \
-T 8 ;
fi
done ;
rm temp.* ;
This should now run to completion and not return any error.
NB - this does not actually remove the variants from your data. It just excludes them when SHAPEIT is trying to determine the alignment between your data and the reference. If this command runs to completion without error, then you can proceed to the next step, #3
3, now perform pre-phasing
Here, we again instruct SHAPEIT to not include the problematic variants. Ultimately, these will therefore be lost from the dataset from this point.
for chr in X {1..22}; do
plink --bfile MyData --chr "${chr}" --make-bed --out temp
if [ "${chr}" != "X" ]
then
srun --mem=12 --cpus-per-task=8 --partition=serial \
shapeit \
-B temp \
-M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
--input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \
--exclude-snp Prephased/GSA_QCd_chr"${chr}"_alignments.snp.strand.exclude \
-O Prephased/MyData_chr"${chr}"_1KGphased \
-T 8 ;
fi
done ;
rm temp.* ;
Have you tried to flip the SNPs that are discordant between your data and your reference (e.g.: --flip function in PLINK) and see if fewer SNPs are excluded?
thank you I will try this
What is the source of the data that you are imputing, and what is the reference panel? Also, can you share the command(s) that you are using? I have recently completed 2 imputations - for each, I ran 3 separate SHAPEIT commands in order to ensure that the data was correctly pre-phased.
Hi Kevin, It's a genotyping data from Illumina chip, Its 37 buit but I did the liftover to 38. Firstly I run the check command with shapeit to compare my data to reference panel (1000 genomes) and this step infom me that a lot of SNPs (for example for chr1 more 30000 will be excluded...) Thank you for your help
I see - thank you! From where did you obtain the 1000 Genomes data? - most data that is available is GRCh37.
This link : http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/