Question

Comparing two columns of one file with two columns of another file, print all the matching in one file and mismatching in another file. Python or Bash ?

0

Entering edit mode

5.8 years ago

Thanujay S • 0

Hello all! SO guys have two files, both of them are tab-separated text files, trying to merge both the files based on two columns. The two files are unsorted and doesn't contain a header. Another thing is that Final.tsv is large containing about 2 million rows.

final.tsv

ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000028  Cryptorchidism  MONDO:0010645   oculocerebrorenal syndrome 
ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000083  Renal insufficiency MONDO:0010645   oculocerebrorenal syndrome
ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000091  Abnormal renal tubule morphology    MONDO:0010645   oculocerebrorenal syndrome
ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000093  Proteinuria MONDO:0010645   oculocerebrorenal syndrome
ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000121  Nephrocalcinosis    MONDO:0010645   oculocerebrorenal syndrome
ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000164  Abnormality of the dentition    MONDO:0010645   oculocerebrorenal syndrome
ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000189  Narrow palate   MONDO:0010645   oculocerebrorenal syndrome
ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000194  Open mouth  MONDO:0010645   oculocerebrorenal syndrome
ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000219  Thin upper lip vermilion    MONDO:0010645   oculocerebrorenal syndrome

om.tsv

309000  LOWE OCULOCEREBRORENAL SYNDROME HP:0000028  OMIM:309000 XLR
309000  LOWE OCULOCEREBRORENAL SYNDROME HP:0000083  OMIM:309000 XLR
309000  LOWE OCULOCEREBRORENAL SYNDROME HP:0000093  OMIM:309000 XLR
309000  LOWE OCULOCEREBRORENAL SYNDROME HP:0000501  OMIM:309000 XLR
309000  LOWE OCULOCEREBRORENAL SYNDROME HP:0000505  OMIM:309000 XLR
309000  LOWE OCULOCEREBRORENAL SYNDROME HP:0000519  OMIM:309000 XLR
309000  LOWE OCULOCEREBRORENAL SYNDROME HP:0000568  OMIM:309000 XLR
309000  LOWE OCULOCEREBRORENAL SYNDROME HP:0000718

So the task here is match the 6th and 3rd column of the final.tsv file with the 2nd and 3rd column of the om.tsv file. On matching of the two columns, the two files should be merged and saved in a match file. If there is not matched the entire line should be printed into another mismatch file. Also note that i need a case-insensitive approach where the match is based on keywords.

Acc to the above eg oculocerebrorenal syndrome should match LOWE OCULOCEREBRORENAL SYNDROME.

Output

  ClinVarVariant:208014 OCRL:exon 6-12 del  HP:0000028  Cryptorchidism  MONDO:0010645   oculocerebrorenal syndrome  309000  LOWE OCULOCEREBRORENAL SYNDROME HP:0000028  OMIM:309000 XLR

A lot of different approaches like awk, join and even a few pandas approaches were tried to solve this complexity. Any suggestions! Thanks in advance! :)

awk join python matching • 3.9k views

ADD COMMENT • link updated 5.8 years ago by massa.kassa.sc3na ▴ 650 • written 5.8 years ago by Thanujay S • 0

0

Entering edit mode

Hi, do you mean that the 6th and 3rd column to act like key (e.g. "HP:0000028 - oculocerebrorenal syndrome" should only match "LOWE OCULOCEREBRORENAL SYNDROME - HP:0000028" and not "LOWE OCULOCEREBRORENAL SYNDROME - HP:0000083").

Since in the om.tsv you have different key in the 2nd column then in the 6th column from final.tsv, what are other examples (or how do you define a match? Is the key in om.tsv always longer then in final.tsv)?

ADD REPLY • link 5.8 years ago by massa.kassa.sc3na ▴ 650

0

Entering edit mode

Hey! Yeah exactly. It should act like a key and only match where both the name and the HP ID matches. In om.tsv the combination of these columns are scattered, I wanted to find the combination and print it along with the final.tsv. So the two columns of final.tsv will act as a key and would search for the rows in om.tsv, print the matches along final.tsv row as a third file.

ADD REPLY • link 5.8 years ago by Thanujay S • 0

0

Entering edit mode

Hello Thanujay S,

In your example we do not really see where are the spaces and where are the tabs. I can suggest you to use semi colon to better represent your example.

Also,

Acc to the above eg oculocerebrorenal syndrome should match LOWE OCULOCEREBRORENAL SYNDROME.

Can you detail what you expect to be a match and what is not. If you have oculocerebrorenal syndrome and NOT OCULOCEREBRORENAL SYNDROME, you expect it to be a match ? Or oculocerebrorenal should match with OCULOCEREBRORENAL SYNDROME ?

There are some example here using semi colon as separator and the merge is done on 3 columns and also some there

ADD REPLY • link 5.8 years ago by Bastien Hervé 6.4k

0

Entering edit mode

Okay! According to the example ive put forward, I meant that oculocerebrorenal syndrome is a keyword and even if LOWE OCULOCEREBRORENAL SYNDROME is present, it should act as a match. And the suggestions you give must be case insensitive, as om.tsv has all uppercase in the name column and final.tsv has lowercase. Thank you for the links! :)

ADD REPLY • link 5.8 years ago by Thanujay S • 0

0

Entering edit mode

Hi @Thanujay S, but what should be considered as a match? Do om.tsv always contain the longer string (e.g. LOWE OCULOCEREBRORENAL SYNDROME) and final.tsv its substring (e.g. OCULOCEREBRORENAL SYNDROME)?

Without this information is hard to put up some code.

ADD REPLY • link 5.8 years ago by massa.kassa.sc3na ▴ 650

0

Entering edit mode

Hey! That's the thing, sometimes it's like a substring and sometimes it's an exact match. Both the files are formatted different on that particular column.

ADD REPLY • link 5.8 years ago by Thanujay S • 0

0

Entering edit mode

Ok, but the point is that the id in final.tsv is never longer then the corresponding id in om.tsv. Is that so?

ADD REPLY • link 5.8 years ago by massa.kassa.sc3na ▴ 650

0

Entering edit mode

Yeah! It's never longer!

ADD REPLY • link 5.8 years ago by Thanujay S • 0

score 1 · Answer 1 · 2019-10-22

1

Entering edit mode

5.8 years ago

massa.kassa.sc3na ▴ 650

Hi, based on the comments and if you files are truly tab separated (the stuff pasted in the question is not). The following should work, although further optimization for speed might be needed.

import pandas as pd

final = pd.read_table('final.tsv', sep='\t', header=None, names=['cv', 'ocrl', 'hp', 'd', 'mondo', 'shortkey'])
om = pd.read_table('om.tsv', sep='\t', header=None, names=['cv', 'longkey', 'hp', 'omim', 'xr'])

om['longkey'] = om['longkey'].str.lower()

short_keys = list(final['shortkey'].unique())


def findmatch(longkey):
    # expecting only words added
    # this could be optimized "a lot" using set operations 
    #  (but it would require mor knowledge about the data)
    for x in short_keys:
        if x in longkey:
            return x

om['matchkey'] = om['longkey'].apply(findmatch)

df = pd.merge(final, om, how='inner', left_on=['hp', 'shortkey'], right_on=['hp', 'matchkey'])
print(df)

ADD COMMENT • link 5.8 years ago by massa.kassa.sc3na ▴ 650

0

Entering edit mode

Thank you for the try! @massa.kassa.sc3na I'm getting an error in this particular line.

for x in short_keys:
        if x in longkey:
            return x

The error message :

TypeError: argument of type 'float' is not iterable

ADD REPLY • link 5.8 years ago by Thanujay S • 0

0

Entering edit mode

Hi, I've tested it with the provided data, is it possible that you have some 'NaN' value somewhere (that's float in python pandas)?

Also you can add

if not isinstance(longkey, str):
    return ''

But I thing that you should pre-process your data.

ADD REPLY • link 5.8 years ago by massa.kassa.sc3na ▴ 650