Hello all! SO guys have two files, both of them are tab-separated text files, trying to merge both the files based on two columns. The two files are unsorted and doesn't contain a header. Another thing is that Final.tsv is large containing about 2 million rows.
final.tsv
ClinVarVariant:208014 OCRL:exon 6-12 del HP:0000028 Cryptorchidism MONDO:0010645 oculocerebrorenal syndrome
ClinVarVariant:208014 OCRL:exon 6-12 del HP:0000083 Renal insufficiency MONDO:0010645 oculocerebrorenal syndrome
ClinVarVariant:208014 OCRL:exon 6-12 del HP:0000091 Abnormal renal tubule morphology MONDO:0010645 oculocerebrorenal syndrome
ClinVarVariant:208014 OCRL:exon 6-12 del HP:0000093 Proteinuria MONDO:0010645 oculocerebrorenal syndrome
ClinVarVariant:208014 OCRL:exon 6-12 del HP:0000121 Nephrocalcinosis MONDO:0010645 oculocerebrorenal syndrome
ClinVarVariant:208014 OCRL:exon 6-12 del HP:0000164 Abnormality of the dentition MONDO:0010645 oculocerebrorenal syndrome
ClinVarVariant:208014 OCRL:exon 6-12 del HP:0000189 Narrow palate MONDO:0010645 oculocerebrorenal syndrome
ClinVarVariant:208014 OCRL:exon 6-12 del HP:0000194 Open mouth MONDO:0010645 oculocerebrorenal syndrome
ClinVarVariant:208014 OCRL:exon 6-12 del HP:0000219 Thin upper lip vermilion MONDO:0010645 oculocerebrorenal syndrome
om.tsv
309000 LOWE OCULOCEREBRORENAL SYNDROME HP:0000028 OMIM:309000 XLR
309000 LOWE OCULOCEREBRORENAL SYNDROME HP:0000083 OMIM:309000 XLR
309000 LOWE OCULOCEREBRORENAL SYNDROME HP:0000093 OMIM:309000 XLR
309000 LOWE OCULOCEREBRORENAL SYNDROME HP:0000501 OMIM:309000 XLR
309000 LOWE OCULOCEREBRORENAL SYNDROME HP:0000505 OMIM:309000 XLR
309000 LOWE OCULOCEREBRORENAL SYNDROME HP:0000519 OMIM:309000 XLR
309000 LOWE OCULOCEREBRORENAL SYNDROME HP:0000568 OMIM:309000 XLR
309000 LOWE OCULOCEREBRORENAL SYNDROME HP:0000718
So the task here is match the 6th and 3rd column of the final.tsv file with the 2nd and 3rd column of the om.tsv file. On matching of the two columns, the two files should be merged and saved in a match file. If there is not matched the entire line should be printed into another mismatch file. Also note that i need a case-insensitive approach where the match is based on keywords.
Acc to the above eg oculocerebrorenal syndrome should match LOWE OCULOCEREBRORENAL SYNDROME.
Output
ClinVarVariant:208014 OCRL:exon 6-12 del HP:0000028 Cryptorchidism MONDO:0010645 oculocerebrorenal syndrome 309000 LOWE OCULOCEREBRORENAL SYNDROME HP:0000028 OMIM:309000 XLR
A lot of different approaches like awk, join and even a few pandas approaches were tried to solve this complexity. Any suggestions! Thanks in advance! :)
Hi, do you mean that the 6th and 3rd column to act like key (e.g. "HP:0000028 - oculocerebrorenal syndrome" should only match "LOWE OCULOCEREBRORENAL SYNDROME - HP:0000028" and not "LOWE OCULOCEREBRORENAL SYNDROME - HP:0000083").
Since in the
om.tsv
you have different key in the 2nd column then in the 6th column fromfinal.tsv
, what are other examples (or how do you define a match? Is the key inom.tsv
always longer then infinal.tsv
)?Hey! Yeah exactly. It should act like a key and only match where both the name and the HP ID matches. In om.tsv the combination of these columns are scattered, I wanted to find the combination and print it along with the final.tsv. So the two columns of final.tsv will act as a key and would search for the rows in om.tsv, print the matches along final.tsv row as a third file.
Hello Thanujay S,
In your example we do not really see where are the spaces and where are the tabs. I can suggest you to use semi colon to better represent your example.
Also,
Can you detail what you expect to be a match and what is not. If you have
oculocerebrorenal syndrome
andNOT OCULOCEREBRORENAL SYNDROME
, you expect it to be a match ? Oroculocerebrorenal
should match withOCULOCEREBRORENAL SYNDROME
?There are some example here using semi colon as separator and the merge is done on 3 columns and also some there
Okay! According to the example ive put forward, I meant that oculocerebrorenal syndrome is a keyword and even if LOWE OCULOCEREBRORENAL SYNDROME is present, it should act as a match. And the suggestions you give must be case insensitive, as om.tsv has all uppercase in the name column and final.tsv has lowercase. Thank you for the links! :)
Hi @Thanujay S, but what should be considered as a match? Do
om.tsv
always contain the longer string (e.g.LOWE OCULOCEREBRORENAL SYNDROME
) andfinal.tsv
its substring (e.g.OCULOCEREBRORENAL SYNDROME
)?Without this information is hard to put up some code.
Hey! That's the thing, sometimes it's like a substring and sometimes it's an exact match. Both the files are formatted different on that particular column.
Ok, but the point is that the id in
final.tsv
is never longer then the corresponding id inom.tsv
. Is that so?Yeah! It's never longer!