Ihs Score Calculation
3
2
Entering edit mode
11.5 years ago
zy041225 ▴ 70

Hi, I'm an amateur in bioinformatics, now trying to implement calculation of iHS and meet some problems.

I try to run WHAMM, but cannot get the formats of input files on its website. Could anyone provide me an example?

And about the genetic distance, all I got now is the physical positions of SNPs, which is unphased. Now I'm trying to calculate their genetic distance by PHASE from here.

I have also got some reference of genetic distance, but found that I got more SNPs than the ref. What do you suggest I should do?

THANK YOU!

snp • 11k views
ADD COMMENT
2
Entering edit mode
11.5 years ago

Hi, the WHAMM project is discontinued, and the iHS script posted there is old and should not be used.

You can find a better version of the iHS script in this page: http://hgdp.uchicago.edu/Software/ (note that this page is a subfolder of the HGDP browser).

This version of the script contains two examples of input files, that you can use to see how the data must be formatted. Note that the input format for the iHS script is tricky to get, because you have to respect the same tabulation and spaces as in the input file.

p.s. the same folder contains a script for the XP-EHH, which is a cross-comparison neutrality test based on the same principle as the iHS. It can be used to compare to populations, detecting sweeps that occurred in a population, but not in the other.

ADD COMMENT
0
Entering edit mode

Do you know if there is a faster implementation of iHS? The Evolution meeting is coming up and based on some napkin math I won't have my poster done because of iHS...

ADD REPLY
0
Entering edit mode

Split the data and run it in parallel. The ideal region size is about 2 MB - large enough to calculate the EHH decay, but small enough to run in a hour or two.

ADD REPLY
2
Entering edit mode

I re-implemented iHS. My versions runs several hundred times faster :-). The scores are comparable.

ADD REPLY
0
Entering edit mode

Would you please share your programme with me? I'm trying to implement iHS but wondering how to avoid disrupting the region.

ADD REPLY
0
Entering edit mode

Do you know C++?

ADD REPLY
0
Entering edit mode

I've learned some basis of C++

ADD REPLY
0
Entering edit mode

Cool! If you post the code to github and make it publicly available, you will forever remembered as the saviour of all the people who tried to use iHS (and we may help you with the testing).

ADD REPLY
0
Entering edit mode

I am happy to share it with a few people who know the iHS metric well. Would you be willing to give a test drive? I don't want to make it freely available until I am satisfied it works for others, it is easy to use, and produces sane values.

ADD REPLY
0
Entering edit mode

Cool! I'm willing to give it a test!

ADD REPLY
0
Entering edit mode

I would also like to test your program, if possible.

ADD REPLY
0
Entering edit mode

Cool, I'm pretty much interested to try your script, as the other one that I'm using presently (wiHS) is giving me some weird results when I'm comparing the results to the Haplotter browser results. Do you know as well if it's better to use relative Genetic positions instead of absolute ones? (i.e. absolute = constantly increasing , relative = distance to the previous SNP).

ADD REPLY
0
Entering edit mode

I know that Pritchard's iHs assumes monotonically increasing (absolute) genetic map values.

ADD REPLY
0
Entering edit mode

Would you please detail how to split the 2MB region? I'm wondering how to avoid disrupting the region that may have selection signals.

ADD REPLY
0
Entering edit mode

PS. do you know how to implement the calculation of xp-ehh. I met a problem that the scores I got were all 'nan'. the software I used is from http://hgdp.uchicago.edu/Software/

ADD REPLY
0
Entering edit mode
  1. Take some core SNP
  2. Calculate EHH values in a region surrounding the SNP
  3. Find the integral.

This value is called iHH. xp-ehh is the log ratio of each population's iHH.

Possible reasons for 'nan': 1. Some how the ratio is negative (something is weird in the program, areas shouldn't be negative here) 2. The population in the numerator has an iHH of 0. This doesn't seem probable (possible?). Although I know Pritchard's program only considers SNPs with a certain derived allele frequency.

Are you sure your input files are correctly formatted?

ADD REPLY
0
Entering edit mode

- - there is something wrong with my input, so there comes 'nan'...

ADD REPLY
2
Entering edit mode
10.3 years ago
Nick Crawford ▴ 210

There's a new (2014) implementation of most of the 'EHH' algorithms that looks promising. The main caveat being that I haven't actually used it yet.

Haplotype-based scans to detect natural selection are useful to identify recent or ongoing positive selection in genomes. As both real and simulated genomic datasets grow larger, spanning thousands of samples and millions of markers, there is a need for a fast and efficient implementation of these scans for general use. Here we present selscan, an efficient multi-threaded application that implements Extended Haplotype Homozygosity (EHH), Integrated Haplotype Score (iHS), and Cross-population Extended Haplotype Homozygosity (XPEHH). selscan accepts phased genotypes in multiple formats, including TPED, and performs extremely well on both simulated and real data and over an order of magnitude faster than existing available implementations. It calculates iHS on chromosome 22 (22,147 loci) across 204 CEU haplotypes in 353s on one thread (33s on 16 threads) and calculates XPEHH for the same data relative to 210 YRI haplotypes in 578s on one thread (52s on 16 threads). Source code and binaries (Windows, OSX and Linux) are available at this https URL.

ADD COMMENT
0
Entering edit mode

I've used it. It's much faster than any other implementation, and very user-friendly. One thing to note is that what the scores mean in selscan is the opposite of what they mean in iHS (Voight). In selscan, iHS>2 is selection on the derived allele, and iHS< -2 is selection on the ancestral allele. In Voight's iHS, it's the opposite.

ADD REPLY
0
Entering edit mode

Thanks for the feedback. I just read the paper and that's true. I also wanted to estimate iHS on my data but I don't know were to get the ancestral states for my SNPs. How did you do that?

From reading Voight et al (iHS creators) it seems that either positive or negative iHS scores can be a product of selection. So I am thinking in just running selscan and use the absolute iHS value on my region of interest. Would that be wrong?

Cite from Voight et al about negative and positive iHS values:

In principle, we might expect that large negative iHS scores, indicating that a derived allele has swept up in frequency, are of the most interest. However, in simulations, a sweep can also produce large positive iHS values at nearby SNPs if ancestral alleles hitchhike with the selected site. Furthermore, it is plausible that selection may sometimes switch to favor an ancestral allele that has been segregating in the population. For these reasons, we will treat both extreme positive, and extreme negative iHS scores as potentially interesting.

ADD REPLY
0
Entering edit mode

It is true for both iHS (Voight) and selscan that negative and positive iHS scores are suggestive of selection. The sign tells you whether it is the ancestral or derived allele that underwent selection.

There are several options for determining the ancestral state; there is a 1000 Genomes ancestral alignment with ancestral alleles for each variant available using vcftools. There is another post covering this here.

Alternatively, you can use SeattleSeq annotation, which pulls the chimp allele for a list of SNPs that you provide in a variety of formats.

ADD REPLY
0
Entering edit mode

Hi, What is the basis of selecting the threshold |iHS|>2 . Doesn't that depend on the dataset? It would be really great if you clarified my doubt regarding this

Thanks in Advance!

ADD REPLY
0
Entering edit mode

@shreyajha From the Voight et al. 2006:

When the rate of EHH decay is similar on the ancestral and derived alleles, iHH(A)/iHH(D) ~ 1, and hence the unstandardized iHS is ~ 0. Large negative values indicate unusually long haplotypes carrying the derived allele; large positive values indicate long haplotypes carrying the ancestral allele

Note that unstandardized iHS is ln( iHH(A)/iHH(D) ), so that a neutral evolving region will be expected to have a ~ 0 iHS, standardized iHS will weight for allele frequencies and have mean 0 and variance 1. Therefore, |iHS|>2 will represent 2 variances away from the expected under neutral hypothesis, already weighted for allele frequency differences.

Yes, you could consider a lower threshold with the --crit-val flag in selscan if you believe the population you are studying will not present such extreme extended haplotype homozygosity.

ADD REPLY
0
Entering edit mode
11.5 years ago
zy041225 ▴ 70

the speed of PHASE is too slow for calculating recombination rate. Is there any suggestions?

Besides, I'm trying to implement LDHat, but meet a problem about the lookup table.

ADD COMMENT

Login before adding your answer.

Traffic: 1771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6