Question

How do I match two files using one column as reference?

0

Entering edit mode

4.7 years ago

eennadi ▴ 40

I have two files with with two columns the. File 1 CBS11016_00000185-RA h. CBS11016_00000186-RA h. CBS11016_00000187-RA h. CBS11016_00000188-RA h. CBS11016_00000189-RA h. CBS11016_00000190-RA h. CBS11016_00000191-RA h. CBS11016_00000192-RA h. CBS11016_00000193-RA h. CBS11016_00000194-RA h. CBS11016_00000195-RA h. CBS11016_00000196-RA h. CBS11016_00000197-RA h. CBS11016_00000198-RA h. CBS11016_00000199-RA h. CBS11016_00000200-RA h. CBS11016_00000201-RA h. CBS11016_00000202-RA h. CBS11016_00000203-RA h. CBS11016_00000204-RA h. CBS11016_00000205-RA h. CBS11016_00000206-RA h. CBS11016_00000207-RA h. CBS11016_00000208-RA h. CBS11016_00000209-RA h. CBS11016_00000210-RA h. CBS11016_00000211-RA h. CBS11016_00000212-RA h. CBS11016_00000213-RA h. CBS11016_00000214-RA h. CBS11016_00000215-RA h. CBS11016_00000216-RA h. CBS11016_00000217-RA h. CBS11016_00000218-RA h. CBS11016_00000219-RA h. CBS11016_00000220-RA h. CBS11016_00000221-RA h. CBS11016_00000222-RA h. CBS11016_00000223-RA h. CBS11016_00000224-RA h.

File 2

CBS11016_00000185-RA Dip5p CBS11016_00000194-RA hypothetical protein CBS11016_00000189-RA hypothetical protein CBS11016_00000215-RA Threonine--tRNA ligase CBS11016_00000199-RA NAD-dependent protein deacylase CBS11016_00000190-RA Cellulase domain-containing protein CBS11016_00000223-RA 40S ribosomal protein S16 CBS11016_00000224-RA hypothetical protein CBS11016_00000225-RA hypothetical protein CBS11016_00000226-RA hypothetical protein CBS11016_00000230-RA 40S ribosomal protein S1 CBS11016_00000231-RA hypothetical protein CBS11016_00000232-RA 60S ribosomal protein L6 CBS11016_00000233-RA FAS1 domain-containing protein CaO19.3004 CBS11016_00000235-RA hypothetical protein CBS11016_00000239-RA Octanoyltransferase CBS11016_00000245-RA hypothetical protein CBS11016_00000247-RA SsDNA-dependent ATPase CBS11016_00000248-RA MRP-S28 domain-containing protein CBS11016_00000249-RA hypothetical protein

I want a file that can use the column one of file 1 as a reference and match it with the equivalence in column 2 of file 2 I had used awk 'NR==FNR{a[$1]=$2;next} {print $0,a[$1]}' file1 flie2 > newfile

But it rather would combine the column 2 and shorten the length of the strings in file 2. Can any help with a better expression?

awk command line • 1.4k views

ADD COMMENT • link updated 4.7 years ago by Yean ▴ 140 • written 4.7 years ago by eennadi ▴ 40

0

Entering edit mode

Please format this example data using code button. In present format it is illegible. You can only provide a couple of rows.

code_formatting

Thank you!

ADD REPLY • link 4.7 years ago by GenoMax 149k

0

Entering edit mode

Cross-posted: https://stackoverflow.com/questions/62262119

ADD REPLY • link 4.7 years ago by h.mon 35k

0

Entering edit mode

Check merge function in R.

ADD REPLY • link 4.7 years ago by grant.hovhannisyan ★ 2.6k

0

Entering edit mode

try sqldf in R.

ADD REPLY • link 4.7 years ago by cpad0112 21k

0

Entering edit mode

Take a look at the commandline tools sort and join.

ADD REPLY • link 4.7 years ago by Joe 22k

0

Entering edit mode

Check this post (in R)

https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right

ADD REPLY • link 4.7 years ago by Yean ▴ 140

score 0 · Answer 1 · 2020-06-08

0

Entering edit mode

4.7 years ago

gdaly9000 ▴ 10

I think Python Pandas offers a great interface to do these types of merges. The R Tidyverse has similar functionality (I believe dplyr). You can actually perform some relatively complicated merges with Pandas.

ADD COMMENT • link 4.7 years ago by gdaly9000 ▴ 10

score 0 · Answer 2 · 2020-06-08

Don't think I have ever seen so many people follow the thread where the question is unreadable. Or so many posters answer the unreadable question, for that matter. I will join in the fun.

This is a python solution, and it assumes that you have pandas installed. Let's say your two files are called file1.csv and file2.csv and that you want to merge them using the shared ID column.

import pandas as pd
f1 = pd.read_csv('file1.csv')
f2 = pd.read_csv('file2.csv')
f3 = f1.merge(f2, left_on='ID', right_on='ID', how='outer')
f3.to_csv('file3.csv', index=False)