Hi, I'm an amateur in bioinformatics, now trying to implement calculation of iHS and meet some problems.
I try to run WHAMM, but cannot get the formats of input files on its website. Could anyone provide me an example?
And about the genetic distance, all I got now is the physical positions of SNPs, which is unphased. Now I'm trying to calculate their genetic distance by PHASE from here.
I have also got some reference of genetic distance, but found that I got more SNPs than the ref. What do you suggest I should do?
This version of the script contains two examples of input files, that you can use to see how the data must be formatted. Note that the input format for the iHS script is tricky to get, because you have to respect the same tabulation and spaces as in the input file.
p.s. the same folder contains a script for the XP-EHH, which is a cross-comparison neutrality test based on the same principle as the iHS. It can be used to compare to populations, detecting sweeps that occurred in a population, but not in the other.
Do you know if there is a faster implementation of iHS? The Evolution meeting is coming up and based on some napkin math I won't have my poster done because of iHS...
Split the data and run it in parallel. The ideal region size is about 2 MB - large enough to calculate the EHH decay, but small enough to run in a hour or two.
Cool! If you post the code to github and make it publicly available, you will forever remembered as the saviour of all the people who tried to use iHS (and we may help you with the testing).
I am happy to share it with a few people who know the iHS metric well. Would you be willing to give a test drive? I don't want to make it freely available until I am satisfied it works for others, it is easy to use, and produces sane values.
Cool, I'm pretty much interested to try your script, as the other one that I'm using presently (wiHS) is giving me some weird results when I'm comparing the results to the Haplotter browser results. Do you know as well if it's better to use relative Genetic positions instead of absolute ones? (i.e. absolute = constantly increasing , relative = distance to the previous SNP).
PS. do you know how to implement the calculation of xp-ehh. I met a problem that the scores I got were all 'nan'.
the software I used is from http://hgdp.uchicago.edu/Software/
Calculate EHH values in a region surrounding the SNP
Find the integral.
This value is called iHH.
xp-ehh is the log ratio of each population's iHH.
Possible reasons for 'nan':
1. Some how the ratio is negative (something is weird in the program, areas shouldn't be negative here)
2. The population in the numerator has an iHH of 0. This doesn't seem probable (possible?). Although I know Pritchard's program only considers SNPs with a certain derived allele frequency.
Are you sure your input files are correctly formatted?
There's a new (2014) implementation of most of the 'EHH' algorithms that looks promising. The main caveat being that I haven't actually used it yet.
Haplotype-based scans to detect natural selection are useful to identify recent or ongoing positive selection in genomes. As both real and simulated genomic datasets grow larger, spanning thousands of samples and millions of markers, there is a need for a fast and efficient implementation of these scans for general use. Here we present selscan, an efficient multi-threaded application that implements Extended Haplotype Homozygosity (EHH), Integrated Haplotype Score (iHS), and Cross-population Extended Haplotype Homozygosity (XPEHH). selscan accepts phased genotypes in multiple formats, including TPED, and performs extremely well on both simulated and real data and over an order of magnitude faster than existing available implementations. It calculates iHS on chromosome 22 (22,147 loci) across 204 CEU haplotypes in 353s on one thread (33s on 16 threads) and calculates XPEHH for the same data relative to 210 YRI haplotypes in 578s on one thread (52s on 16 threads). Source code and binaries (Windows, OSX and Linux) are available at this https URL.
I've used it. It's much faster than any other implementation, and very user-friendly. One thing to note is that what the scores mean in selscan is the opposite of what they mean in iHS (Voight). In selscan, iHS>2 is selection on the derived allele, and iHS< -2 is selection on the ancestral allele. In Voight's iHS, it's the opposite.
ADD REPLY
• link
updated 3.1 years ago by
Ram
44k
•
written 10.4 years ago by
bmpbowen
▴
40
0
Entering edit mode
Thanks for the feedback. I just read the paper and that's true. I also wanted to estimate iHS on my data but I don't know were to get the ancestral states for my SNPs. How did you do that?
From reading Voight et al (iHS creators) it seems that either positive or negative iHS scores can be a product of selection. So I am thinking in just running selscan and use the absolute iHS value on my region of interest. Would that be wrong?
Cite from Voight et al about negative and positive iHS values:
In principle, we might expect that large negative iHS scores, indicating that a derived allele has swept up in frequency, are of the most interest. However, in simulations, a sweep can also produce large positive iHS values at nearby SNPs if ancestral alleles hitchhike with the selected site. Furthermore, it is plausible that selection may sometimes switch to favor an ancestral allele that has been segregating in the population. For these reasons, we will treat both extreme positive, and extreme negative iHS scores as potentially interesting.
ADD REPLY
• link
updated 3.1 years ago by
Ram
44k
•
written 10.2 years ago by
JMR
▴
160
0
Entering edit mode
It is true for both iHS (Voight) and selscan that negative and positive iHS scores are suggestive of selection. The sign tells you whether it is the ancestral or derived allele that underwent selection.
There are several options for determining the ancestral state; there is a 1000 Genomes ancestral alignment with ancestral alleles for each variant available using vcftools. There is another post covering this here.
Alternatively, you can use SeattleSeq annotation, which pulls the chimp allele for a list of SNPs that you provide in a variety of formats.
ADD REPLY
• link
updated 2.9 years ago by
Ram
44k
•
written 10.2 years ago by
bmpbowen
▴
40
0
Entering edit mode
Hi,
What is the basis of selecting the threshold |iHS|>2 . Doesn't that depend on the dataset?
It would be really great if you clarified my doubt regarding this
When the rate of EHH decay is similar on the ancestral and derived alleles, iHH(A)/iHH(D) ~ 1, and hence the unstandardized iHS is ~ 0. Large negative values indicate unusually long haplotypes carrying the derived allele; large positive values indicate long haplotypes carrying the ancestral allele
Note that unstandardized iHS is ln( iHH(A)/iHH(D) ), so that a neutral evolving region will be expected to have a ~ 0 iHS, standardized iHS will weight for allele frequencies and have mean 0 and variance 1. Therefore, |iHS|>2 will represent 2 variances away from the expected under neutral hypothesis, already weighted for allele frequency differences.
Yes, you could consider a lower threshold with the --crit-val flag in selscan if you believe the population you are studying will not present such extreme extended haplotype homozygosity.
Do you know if there is a faster implementation of iHS? The Evolution meeting is coming up and based on some napkin math I won't have my poster done because of iHS...
Split the data and run it in parallel. The ideal region size is about 2 MB - large enough to calculate the EHH decay, but small enough to run in a hour or two.
I re-implemented iHS. My versions runs several hundred times faster :-). The scores are comparable.
Would you please share your programme with me? I'm trying to implement iHS but wondering how to avoid disrupting the region.
Do you know C++?
I've learned some basis of C++
Cool! If you post the code to github and make it publicly available, you will forever remembered as the saviour of all the people who tried to use iHS (and we may help you with the testing).
I am happy to share it with a few people who know the iHS metric well. Would you be willing to give a test drive? I don't want to make it freely available until I am satisfied it works for others, it is easy to use, and produces sane values.
Cool! I'm willing to give it a test!
I would also like to test your program, if possible.
Cool, I'm pretty much interested to try your script, as the other one that I'm using presently (wiHS) is giving me some weird results when I'm comparing the results to the Haplotter browser results. Do you know as well if it's better to use relative Genetic positions instead of absolute ones? (i.e. absolute = constantly increasing , relative = distance to the previous SNP).
I know that Pritchard's iHs assumes monotonically increasing (absolute) genetic map values.
Would you please detail how to split the 2MB region? I'm wondering how to avoid disrupting the region that may have selection signals.
PS. do you know how to implement the calculation of xp-ehh. I met a problem that the scores I got were all 'nan'. the software I used is from http://hgdp.uchicago.edu/Software/
This value is called iHH. xp-ehh is the log ratio of each population's iHH.
Possible reasons for 'nan': 1. Some how the ratio is negative (something is weird in the program, areas shouldn't be negative here) 2. The population in the numerator has an iHH of 0. This doesn't seem probable (possible?). Although I know Pritchard's program only considers SNPs with a certain derived allele frequency.
Are you sure your input files are correctly formatted?
- - there is something wrong with my input, so there comes 'nan'...