Hello.
I will have data where the ancestral and derived alleles are sometimes not encoded in the .ihshap file. i.e. The regions I'm interested in vary per individual. If I just take the regions of overlap I lose a lot of data and am not sure if the iHs score will be valid. Stitching together blocks of overlap into one coherent ihshap file would result in the SNPs on the borders of those blocks using SNPs from differing blocks, sometimes very far away physically on the chromosome, for the ehh scores to be integrated.
e.g.
Normal .ihshap
input file for iHs:
1 0 1 0 1 0 0 0 0 1 1 1 0 1 1
0 1 1 0 1 0 1 0 1 0 0 0 0 1 0
0 1 0 1 0 1 0 0 0 0 1 0 0 0 1
1 0 0 0 1 0 1 1 1 1 1 0 0 0 1
My data:
? ? ? ? ? ? 0 0 0 1 1 1 0 1 1
0 1 1 0 ? ? ? ? ? ? 0 0 0 1 0
0 1 ? ? ? ? ? ? ? ? 1 0 0 0 1
1 0 0 0 1 0 ? ? ? ? ? ? 0 ? ?
Out of 100,000 SNPs I would have ~80-90% missing data.
Thanks for your reply. In this case I have information on all genotypes, but willing exclude a majority of it so as to only look at portions of each individual genome that have the same ancestry (e.g. European). If I included all data then iHs would be comparing SNPs from different ancestries, and my results wouldn't make sense in terms of "this SNP was positively selected for in Europeans". If my understanding is flawed or if there is a reasonable way to implement missing data support (short of writing my own iHs calculator, which I am considering) I would be appreciative to hear it.