Hi,
I need to find the pairwise allelic distance between individuals from a PLINK (PED/MAP) file. I have written a Python code for it and it's taking a long time to compute the similarity distance matrix. My input file is huge (~25 GB) with 5000 individuals and ~2.6 million SNPs. I couldn't find any software package to compute the allelic distance and hence I decided to write my own code. This should be pretty straight forward in Matlab, but, because of the sheer volume of the data, I decided to go with Python (Though, I'm very new to Python and learning my trade around it). I am stating the problem here:
I have a file as such:
FAM001 1 0 0 1 2 A A G G A C
FAM002 2 0 0 1 2 A A A G 0 0
And the allelic distance metric is:
For each SNP count how many alleles differ between two samples at the same locus. If the first sample is AA and the second is AG, then the allelic distance is 1, if the first is AA and the second is GG, then it is two, if they are both the same, it is zero, etc. After finding this for all the locus which is about (~1.3 mil) Sum up over all SNPs and this should be the allelic distance. If there are missing entries in either sample, ignore that SNP. To normalize, we can divide the sum by the number of SNPS taken to compute this, to maintain generality.
My code is this:
import numpy as np
import difflib
lines = np.genfromtxt('data_ped.txt')
s = (len(lines),len(lines))
mat = np.zeros(s)
for I in np.arange(len(lines)):
val = np.split(lines[i],2)
j = i+1
for j in (np.arange(len(lines)-1)):
c_val = []
newval = np.split(lines[j],2)
for k in np.arange(len(val)):
matching = difflib.SequenceMatcher(None,a=val[k], b=newval[k])
match = matching.find_longest_match(0,len(matching.a),0,len(matching.b))
count = len(matching.a[match.a:match.a+match.size])
c_val.append(count)
(mat[i])[j] = sum(c_val)/len(val)
np.savetxt('opma.txt',mat)
This is taking awfully long to compute and I am not too sure whether this works as well. It'd be great if someone can interpret the problem and help me with this. Or, if anyone has any knowledge about a software that I can use to compute this, that'd be great as well.
Thanks!
Hello aritra90!
It appears that your post has been cross-posted to another site: http://stackoverflow.com/questions/31767307
This is typically not recommended as it runs the risk of annoying people in both communities.
Thanks for bringing my attention to this. I wasn't sure where to post this question as I haven't used the forums extensively before. Sorry about that. I'll remove the other post.