P.S: I need to use Python for this.
I have a text file which looks like this:
# sampleID HGDP00511 HGDP00511 HGDP00512 HGDP00512 HGDP00513 HGDP00513
M rs4124251 0 0 A G 0 A
M rs6650104 0 A C T 0 0
M rs12184279 0 0 G A T 0
I want to compare the consecutive columns and return the number of matching elements. I want to do this in Python. Earlier, I did it using Bash and AWK (shell scripting), but its very slow, as I have huge data to process. I believe Python would be a faster solution to this. But, I am very new to Python and I already have something like this:
for line in open("phased.txt"):
columns = line.split("\t")
for I in range(len(columns)-1):
a = columns[i+3]
b = columns[i+4]
for j in range(len(a)):
if a[j] != b[j]:
print j
which is obviously not working. As I am very new to Python, I don't really know what changes to make to get this to work. (This is code is completely wrong and I guess I could use difflib, etc. But, I have never proficiently coded in Python before, so, skeptical to proceed)
I want to compare and return the number of non matching elements in each column(starting from the third) to every other column in the file. I have 828 columns in totality. Hence I would need 828828 number of outputs. (You can think of a nn matrix where the (i,j)th element would be the number of non matching elements between them. My desired output in case of the above snippet would be:
3 4: 1
3 5: 3
3 6: 3
..
..
..
4 6: 3
..
Any help on this would be appreciated. Thanks.
0 elements are considered as NA values and so not included in non matching counts, right? And what about an R coding approach instead of python?