Hello everybody,
I need an advice concerning the reconstruction of haplotypes from the genotypes on SNPs for different sets of individuals. Here is the situation:
I am using fastPHASE. I have 2 different levels of analysis: - first level, the global level: looking at a set of 940 individuals - second level, the regional level: looking at subsets of these 940 individuals. Then I have 100 individuals for Africa, 64 for AMerica and so on.
I have filtered the SNPs for MAF>=0.05 and known genotype for >= 90% of the individuals for each region and for the global level (giving different subsets of SNPs). So I am wondering if I have to run fastPHASE for each region or if, for each region I can extract from the phased data obtained for the Global, the haplotypes for my sub sets of SNPs and individuals. This is possible because intrinsically at the Global level the subset of SNPs does contain all the SNPs from each subset for each population.
Since fastPHASE is very time-demanding, extracting from the phased data obtained for the Global level will allow me to save A LOT OF TIME: I wouldn't run fastPHASE for the 7 regions. On the other hand I guess that fastPHASE do not run the same way if we have intermediate SNPs (extracting form phased haplotypes obtained at the Global level) and if not (running fastPHASE for each region). How important will you expect the difference to be?
I do not know if it is clear enough and if you have a defenitive answer for this. Anyway thanks for your help!
cheers Pierre
This is an interesting question. Can you add links for the software you are using, please?
Here it is: http://stephenslab.uchicago.edu/software.html You have two quite similar software: PHASE and fastPHASE. The algorithm is a bit different in the second one which makes it faster although more approximative.
Can you provide some more information? I think this is a very interesting toppic, but also a very complex one. I would like to know about the input data for the software. I assume it takes the genotypes of each person as a vector like f.e. (C,C/A,T,T/A), is that correct? Does the software recognize the sub-groups at all if you give the whole population data? I mean, can you specify sub-groups at all?
Did you contact the author already?
Oooops sorry I haven't been alerted that there was a comment to answer. The inpout data is as following: a genotypefile: for each individual 2 vector of A/T/G/C, one per chromosome a (facultative) population file: where you precise the population code for each individual. And that's it! So I do not know if you know the HGDP panel: I have 39 populations in 7 regions. I did precise the appartenance to one of 39 populations for each individual. Do you need more precision?