Compare consecutive columns of a file and return the number of non-matching elements
2
0
Entering edit mode
9.5 years ago
aritra90 ▴ 70

P.S: I need to use Python for this.

I have a text file which looks like this:

# sampleID  HGDP00511  HGDP00511   HGDP00512   HGDP00512   HGDP00513  HGDP00513   
M rs4124251       0       0            A            G          0          A
M rs6650104       0       A            C            T          0          0
M rs12184279      0       0            G            A          T          0

I want to compare the consecutive columns and return the number of matching elements. I want to do this in Python. Earlier, I did it using Bash and AWK (shell scripting), but its very slow, as I have huge data to process. I believe Python would be a faster solution to this. But, I am very new to Python and I already have something like this:

for line in open("phased.txt"):
    columns = line.split("\t")

    for I in range(len(columns)-1):
        a = columns[i+3]
        b = columns[i+4]
        for j in range(len(a)):
            if a[j] != b[j]:
                print j

which is obviously not working. As I am very new to Python, I don't really know what changes to make to get this to work. (This is code is completely wrong and I guess I could use difflib, etc. But, I have never proficiently coded in Python before, so, skeptical to proceed)

I want to compare and return the number of non matching elements in each column(starting from the third) to every other column in the file. I have 828 columns in totality. Hence I would need 828828 number of outputs. (You can think of a nn matrix where the (i,j)th element would be the number of non matching elements between them. My desired output in case of the above snippet would be:

3 4: 1
3 5: 3
3 6: 3
..
..
..
4 6: 3
..

Any help on this would be appreciated. Thanks.

Haplotype Beagle Python • 7.6k views
ADD COMMENT
0
Entering edit mode

0 elements are considered as NA values and so not included in non matching counts, right? And what about an R coding approach instead of python?

ADD REPLY
2
Entering edit mode
9.5 years ago
george.ry ★ 1.2k

If I understand what you're after correctly, then:

with open('test.tsv') as f:
    line = f.readline().strip().split()
    num_samples = len(line)-2
    samples = [[] for I in range(num_samples)]
    for line in f:
        line = line.strip().split()
        for s, sample in zip(line[2:], samples):
            sample.append(s)

for i, sample in enumerate(samples[:-1]):
    for j in range(i+1, num_samples):
        print(i+3, j+3, sum(a != b for a, b in zip(sample, samples[j])))

If you're using Python2 then beforehand:

from itertools import izip as zip
from __future__ import print_function
ADD COMMENT
0
Entering edit mode

George,

Can't thank you enough. You were spot on! This increased my interest in Python :)

ADD REPLY
1
Entering edit mode
9.5 years ago
Aerval ▴ 290

I would do something like this:

rows = []
with open("phased.txt") as f:
    for line in f:
        rows.append(line.strip().split("\t"))

for i, rowi in enumerate(rows[1:]): # skipping the first row because it the column description
    for j, rowj in enumerate(rows[1:]):
        matches = 0
        for n in range(len(rowi)-2): # skipping the first two columns
            if rowi[n+2] == rowj[n+2]:
                matches += 1
        print i, j, matches

Note that I am not sure whether this is much faster than bash (especially because its just printing and not writing to a file)

ADD COMMENT
0
Entering edit mode

Thanks for the help, much appreciated!

ADD REPLY

Login before adding your answer.

Traffic: 2945 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6