Haplotype Phasing with Shapeit2
1
6
Entering edit mode
8.3 years ago
SOHAIL ▴ 410

Hi everyone,

I am anew in haplotype phasing. I have 14 WGS individuals data from one human population and i want to perform Haplotype phasing by utilizing Shapeit2. In 1000 Genomes paper supplementary information it is mentioned phasing performed in two steps:

  1. Creation of Haplotype scaffolds from microarray genotypes
  2. Joint phasing of biallelic SNPs, Indels and high-confidence deletions onto the haplotype scaffold.

My questions are:

  1. I have VCF files containing both SNPs and INDELs called by standard GATK pipeline as starting point. I dont have genotype array data for sequenced individuals, What latest haplotype reference panel i should utilize in order to perform phasing?

  2. Is that haplotype Phasing is population specific (I.e. different population individuals can have different haplotype structure according to their respective population)? I mean can variant sets from different sets of individuals from different populations be phased together?

  3. Is there any comprehensive tutorial available online that details utility of shapeit2 starting from VCF files step-by-step?

I will be very happy to read any suggestions as starting point regarding "how to perform haplotype phasing with shapeit?".

Thanks in advance!

ngs • 12k views
ADD COMMENT
0
Entering edit mode

I am also interested in your question 2 about whether we should phase population together or separately. This is the information I found (see figure 4):

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3415548/

Note: I haven't confirmed whether the findings are applicable to other phasing algorithms.

ADD REPLY
5
Entering edit mode
8.3 years ago
Shab86 ▴ 310

The approach to 1KG phasing is a bit different than how everyone else does phasing. The 1KG aproach is to leverage both the genotyped samples with the low coverage sequencing on the same to create the haplotype scaffolds. However, you don't need to follow the same approach as you dont have the geno/seq samples.

1). The simplest approach is thus to use SHAPEIT2 to phase your dataset without using a reference panel: https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#gettingstarted What is needed here is the human genetic map (HapMap phase II b37 for example): https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#gmap

On the other hand, if you want to use a reference panel you can use the 1KG one or the HapMap one: https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_pilot_plus_hapmap3.html They are work quite well on admixed population data and provide quite accurate phased haplotypes.

2). This can be good starting point: http://www.nature.com/nrg/journal/v12/n10/full/nrg3054.html https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4888899/pdf/nihms757128.pdf

3). Tutorial is available from the Shapeit2 website itself which I believe is quite detailed: https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#gettingstarted One thing to start off is convert VCF to PLINK format using plink --recode-vcf option: https://www.cog-genomics.org/plink2/data#recode Once you have that the shapeit website is more than enough for you to begin phasing.

ADD COMMENT
0
Entering edit mode

Thank you very much @Shab86 for explaining everything quite comprehensively. i have few more questions:

  1. About Reference panel (that you suggested above): The human genetic map (HapMap phase II b37): https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#gmap The 1KGP reference panel: https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_pilot_plus_hapmap3.html

looks pretty older ones (especially reference panel) but not latest. Could you please suggest any recent genetic map and reference panel files that can be used in SHAPEIT2? I read about the recent one here: https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#reference https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#reference

What do you think?

  1. Secondly, I read in SHAPEIT tutorial that VCF files can also be used as input file format, how you see such scenario: https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#input

shapeit --input-vcf gwas.vcf \ -M genetic_map.txt \ -O gwas.phased or you can also look at the read-ware phasing as well.

  1. Again, the reference panel will contain haplotypes for multiple populations individuals, Should i use only a set group of individuals from one continent "(for instance individuals from 1KGP of group "EUR") as reference or should i use the complete reference panel file with all the data set?

  2. any knowledge about "--no-mcmc" option used in SHAPEIT. i mean when it should be used and avoided?

Thank you very much!

ADD REPLY
0
Entering edit mode

Yeah sure, go ahead with the Impute2 website's 1KG 2014 ref panel. And also, use the hg38 genetic map, that's UCSC's recent build I believe. Yeah, sure use VCF files for shapeit input, there's no difference in the output if you use plink or vcf.

Lastly, it depends on which population you have. The hapmap/1kg ref panels are quite diverse and represent european origin populations quite well. However, if you have say really isolated ones, it might not be a good idea to have an admixed ref panel. So, where is the sample from?

Also, don't use --no-mcmc option as its to be used only for very, very small samples and though it speeds up phasing but increases haplotype estimation errors !!!

ADD REPLY

Login before adding your answer.

Traffic: 1479 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6