Question

Compare consecutive columns of a file and return the number of non-matching elements

0

Entering edit mode

10.2 years ago

aritra90 ▴ 70

P.S: I need to use Python for this.

I have a text file which looks like this:

# sampleID  HGDP00511  HGDP00511   HGDP00512   HGDP00512   HGDP00513  HGDP00513   
M rs4124251       0       0            A            G          0          A
M rs6650104       0       A            C            T          0          0
M rs12184279      0       0            G            A          T          0

I want to compare the consecutive columns and return the number of matching elements. I want to do this in Python. Earlier, I did it using Bash and AWK (shell scripting), but its very slow, as I have huge data to process. I believe Python would be a faster solution to this. But, I am very new to Python and I already have something like this:

for line in open("phased.txt"):
    columns = line.split("\t")

    for I in range(len(columns)-1):
        a = columns[i+3]
        b = columns[i+4]
        for j in range(len(a)):
            if a[j] != b[j]:
                print j

which is obviously not working. As I am very new to Python, I don't really know what changes to make to get this to work. (This is code is completely wrong and I guess I could use difflib, etc. But, I have never proficiently coded in Python before, so, skeptical to proceed)

I want to compare and return the number of non matching elements in each column(starting from the third) to every other column in the file. I have 828 columns in totality. Hence I would need 828828 number of outputs. (You can think of a nn matrix where the (i,j)th element would be the number of non matching elements between them. My desired output in case of the above snippet would be:

3 4: 1
3 5: 3
3 6: 3
..
..
..
4 6: 3
..

Any help on this would be appreciated. Thanks.

Haplotype Beagle Python • 8.1k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by aritra90 ▴ 70

0

Entering edit mode

0 elements are considered as NA values and so not included in non matching counts, right? And what about an R coding approach instead of python?

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by Nicola Casiraghi ▴ 500

1

Entering edit mode

10.2 years ago

Aerval ▴ 290

I would do something like this:

rows = []
with open("phased.txt") as f:
    for line in f:
        rows.append(line.strip().split("\t"))

for i, rowi in enumerate(rows[1:]): # skipping the first row because it the column description
    for j, rowj in enumerate(rows[1:]):
        matches = 0
        for n in range(len(rowi)-2): # skipping the first two columns
            if rowi[n+2] == rowj[n+2]:
                matches += 1
        print i, j, matches

Note that I am not sure whether this is much faster than bash (especially because its just printing and not writing to a file)

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by Aerval ▴ 290

0

Entering edit mode

Thanks for the help, much appreciated!

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by aritra90 ▴ 70

Ram · Accepted Answer · 2015-06-05

2

Entering edit mode

10.2 years ago

george.ry ★ 1.2k

If I understand what you're after correctly, then:

with open('test.tsv') as f:
    line = f.readline().strip().split()
    num_samples = len(line)-2
    samples = [[] for I in range(num_samples)]
    for line in f:
        line = line.strip().split()
        for s, sample in zip(line[2:], samples):
            sample.append(s)

for i, sample in enumerate(samples[:-1]):
    for j in range(i+1, num_samples):
        print(i+3, j+3, sum(a != b for a, b in zip(sample, samples[j])))

If you're using Python2 then beforehand:

from itertools import izip as zip
from __future__ import print_function

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by george.ry ★ 1.2k

0

Entering edit mode

George,

Can't thank you enough. You were spot on! This increased my interest in Python :)

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by aritra90 ▴ 70