Question

Forum:A Database Of Signatures Of Selection In The 1000 Genomes Dataset

33

Entering edit mode

11.2 years ago

Giovanni M Dall'Olio 28k

The 1000 Genomes Selection Browser is a database of Signatures of Selection in the Human Genome, based on the 1000 Genomes Phase I data. It is freely accessible at http://hsb.upf.edu/

The browser, based on a custom UCSC Genome Browser installment, allows to easily navigate the genome and visualize regions that are candidate for having been involved in an event of selection in any of the African, European, or Asian populations. The data can also be easily downloaded for further analysis here.

Our browser includes a total of 17 tests for selection. For each test of selection, we provide a raw score, plus a ranked score which compares each position to the rest of the genome.

Tajima’s D (Tajima, 1989): Comparison of estimates of the number of segregating sites and the mean pairwise difference between sequences.
CLR (Nielsen et al., 2005). Multilocus Composite Likelihood Ratio Test. Read more about the CLR
Fay and Wu’s H (Fay & Wu, 2000): Comparison of the number of derived segregating sites at low and high frequencies and the number of variants at intermediate frequencies.
Fu and Li’s F* (Fu, 1997): Comparison of the number of singleton mutations and the mean pairwise difference between sequences.
Fu and Li’s D* (Fu, 1997): Comparison of the number of singleton mutations and the total number of nucleotide variants.
R2 (Ramos-Onsins and Rozas. 2002) Comparison of the difference between the number of singletons per sequence and the average number of nucleotide differences.
XP-EHH (Sabeti et al., 2007): Cross-population extended haplotype homozygosity.
Delta iHH (Voight et al., 2006, Grossman et al., 2010): difference between two integrated haplotype homozygosity scores.
iHS (Voight et al., 2006): log ratio between two integrated haplotype homozygosity scores.
EHH average (Sabeti et al., 2002): Extended halotype homozygosity; weighted average for all core haplotypes of the position at which the haplotype homozygosity decays to <=0.5.
Wall’s B (Wall, 2000): Counts the number of pairs of adjacent segregating sites that are congruent (if the subset of the data consisting of the two sites contains only two different haplotypes)
Wall’s Q (Wall, 2000): Adds the number of partitions (two disjoint subsets whose union is the set of individuals in the sample) induced by congruent pairs to Wall’s B.
Fu’s Fs (Fu, 1997): Based on Ewens’ sampling distribution, taking into account the number of different haplotypes in the sample.
Dh (Nei, 1987): Summary statistic based on the number of different haplotypes in the sample
Fst (Weir and Cockerham, 1984) : global and pairwise
delta DAF: difference of Derived allele frequencies between 2 populations.
XP-CLR (Chen et al., 2010): Multilocus allele frequency differentiation between two populations.

The database has been published in the NAR Database issue 2014:

Pybus M, Dall'olio GM, Luisi P, Uzkudun M, Carreño-Torres A, Pavlidis P, Laayouni H, Bertranpetit J, Engelken J. 1000 Genomes Selection Browser 1.0: a genome browser dedicated to signatures of natural selection in modern humans. Nucleic Acids Res. 2014 Jan 1;42(1):D903-9. doi: 10.1093/nar/gkt1188. Epub 2013 Nov 25. PubMed PMID: 24275494. Available at http://nar.oxfordjournals.org/content/42/D1/D903.short

For completeness we also also link dbPSHP, a database of curated publications about positive selection in different human populations, which also presents the results of 15 tests for positive selection.

1000genomes human selection • 17k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 11.2 years ago by Giovanni M Dall'Olio 28k

1

Entering edit mode

Cool resource and nice post describing it!

ADD REPLY • link 11.1 years ago by Obi Griffith 20k

score 1 · Answer 1 · 2014-03-14

1

Entering edit mode

11.2 years ago

DG 7.3k

Awesome! I can already think of some interesting things I'd like to test out with this data!

ADD COMMENT • link 11.2 years ago by DG 7.3k

0

Entering edit mode

I am glad that you liked it :-) Feel free to ask any question you may have, to me or in the website.

ADD REPLY • link 11.2 years ago by Giovanni M Dall'Olio 28k

score 1 · Answer 2 · 2014-03-28

1

Entering edit mode

11.1 years ago

Rubal7 ▴ 850

Great resource!

ADD COMMENT • link 11.1 years ago by Rubal7 ▴ 850

Ram · Answer 3 · 2015-01-05

1

Entering edit mode

10.3 years ago

yfwangbm ▴ 10

Hi Giovanni, it is a great job. And I check the databases and tried to download the data for iHS score, but some question I am not quite clear. 1. Why the score here are all positive? 2. this is unstandardised or normalized iHS score, and how to do the normalization? In addition, I tried to find the corresponding genetic distance for 1000 genome variants, I saw you it was mentioned in the paper, so how to add the genetic map?

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by yfwangbm ▴ 10

0

Entering edit mode

iHS scores are usually given as absolute values, hence all positives. The negative or positive value that this statistics can give will also depend on the equation used. On the original Voight et al (2006) paper, negative values indicated selection at derived alleles for instance. Also unstardardised iHS values are largely useless, they should always be corrected by the allele frequencies, standardised iHS values are usually reported.

ADD REPLY • link 10.3 years ago by JMR ▴ 160

0

Entering edit mode

From the supplemental information: Raw scores from ΔiHH, iHS and XP-EHH were standardized in bins of derived allele frequency (step size of 0.05) using the respective genome-wide distribution for each statistic to capture signal from ancestral SNPs that have hitchhiked to high frequency along with a selected derived variant, absolute standardized iHS scores were chosen as the end result (24).

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by JMR ▴ 160

Ram · Answer 4 · 2015-10-21

1

Entering edit mode

9.6 years ago

Zev.Kronenberg 12k

I'm having trouble dumping CEU vs CHB XP-EHH. The tables are limited to a chromosome?

Also, I really dig the boosting.

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 9.6 years ago by Zev.Kronenberg 12k

1

Entering edit mode

Hi Zev,

are you downloading the file from the Table Browser? I think there is a limit on the number of rows that can be downloaded from there. For example I downloaded the whole file, and towards the end I see an error message saying "Reached output limit of 100000 data values".

The best way to get the data is to download them from this folder. The files contain both the scores and the log(pvalue).

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.6 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

perfect. thank you very much.

ADD REPLY • link 9.6 years ago by Zev.Kronenberg 12k

0

Entering edit mode

One last question, these data are hg19? I ask because I know the original xpehh was hg18.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Zev.Kronenberg 12k

1

Entering edit mode

Yes, everything is hg19. Feel free to ask as many questions you need ;-)

ADD REPLY • link 9.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

well, since you offered. The "p-values" for the CEU-YRI are not bounded by zero and one. Are they Z-scores? I'm trying to get the joint probability of XP-EHH and DN/DS.

Thanks.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Zev.Kronenberg 12k

2

Entering edit mode

Hi Zev,

the p-values are simply the -log10 of the fraction of SNPs with an higher score. For example in R, using dplyr:

> xpehh = read.table('XPEHH_CEU_vs_CHB.whole_genome.pvalues', header=T, stringsAsFactors=F, colClasses=c('character','integer', 'numeric', 'numeric', 'numeric'))
> xpehh %>% 
    arrange(desc(score)) %>%   # Sort SNPs by XPEHH score (descending)
    mutate(
       rank=row_number(),           # number of SNPs with higher scores
       rank.perc=rank/n(),          # fraction of SNPs with higher score
       rank.log=-log10(rank.perc)   # P-value
    )
         snpID chromosome position    score   pvalue  rank    rank.perc rank.log
         (chr)      (int)    (dbl)    (dbl)    (dbl) (int)        (dbl)    (dbl)
1  rs116972803         15 48377866 8.018417 7.133374     1 7.355728e-08 7.133374
2   rs77517214         15 48377764 8.018099 6.832344     2 1.471146e-07 6.832344
3   rs75870250         15 48376241 7.942488 6.656253     3 2.206718e-07 6.656253
4  rs150960840         15 48379200 7.925932 6.531314     4 2.942291e-07 6.531314

So, I guess that yes, our p-values could be actually called Z-scores, sorry about the confusion :-)

ADD REPLY • link 9.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Giovanni M Dall'Olio I've computed genome-wide iHS for the phase III One Thousand genomes project. Any interest in adding the data to the browser? It only took thousands of CPU hours ;-).

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Hi Zev, I think it would be amazing! Let me ask back to my colleagues at UPF to see how it can be done.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Hey there, Any word on the phase III iHS values?

ADD REPLY • link 7.8 years ago by jglassbrook • 0

Ram · Answer 5 · 2016-01-04

Hi,

Great resource, thanks!

I have a quick question. Is there a way to determine an arbitrary FDR, say of 2%, or 2.5%, or 5% (instead of the default 1% in the downloads) for the boosting results? Or, are there other values/methods to lessen relax the significance thresholds derived from these datasets? So if the CEU Complete boosting threshold at 1% FDR is 0.40199. What would it be at 2.5% FDR. Would this require rerunning the analysis, or can a post threshold be determined. Thanks again for this resource!

Ram · Answer 6 · 2015-10-21

0

Entering edit mode

9.6 years ago

Pierre ▴ 500

Glad that you dig the boosting.
About your question, I just tried it and could visualize all chromosomes. What problem do you face exactly? Is it with track visualization or tables?

Cheers

ADD COMMENT • link 9.6 years ago by Pierre ▴ 500

0

Entering edit mode

The viewing is fine. I'm trying to export the genome-wide XPEHH for CEU_VS_CHB. Every time I download the table it only has chr1.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.6 years ago by Zev.Kronenberg 12k

score 0 · Answer 7 · 2016-06-24

Hi, Giovanni. Thanks this is a great tool. I'm using it to look at some unlinked SNPs in genes with epistatic interactions. One of the SNPs is rs2906999; in hg19 it should be at chr7:76069811. But in several tests focused on the interval around that SNP (iHS, Fst) that coordinate does not appear among the sites with a test statistic. That site is polymorphic in the 1000 Genomes phase 1 data, and has a high minor allele frequency, so I expected it to show up in the test results (the other SNPs I am analyzing do appear in the test results).

I am having trouble figuring out where that one missing SNP might be. Can you help me figure that out? I realize that some rare polymorphisms were filtered out in developing the database, but these are common polymorphisms. Thanks for any help or suggestions (and I hope you are still monitoring this Biostars thread).

Cheers!