Question

how to match column entries among files?

0

Entering edit mode

7.0 years ago

v27 ▴ 10

I have 3 files : File1

WP_082277722.10 gi|37527581|ref|NP_930925.1|    59.420  138 46  3   141 275 2895    3025    1.34e-41    157 49
WP_082277722.10 gi|91779924|ref|YP_555132.1|    49.254  134 56  3   151 276 2074    2203    6.02e-27    115 46
WP_082277722.10 gi|212640077|ref|YP_002316597.1|    33.981  103 57  5   180 275 403 501 0.008   42.7    35
WP_082277722.10 gi|384147149|ref|YP_005529965.1|    37.681  69  35  3   210 276 2019    2081    0.032   41.2    24
WP_082277722.10 gi|212640328|ref|YP_002316848.1|    32.323  99  56  5   184 275 469 563 0.052   40.0    33
WP_082277722.10 gi|338738658|ref|YP_004675620.1|    26.761  71  43  1   205 275 4   65  0.99    35.0    26

File 2

158333741,Acaryochloris_marina_MBIC11017_uid58167,158333741,432,1,432,COG0001,0,
158339504,Acaryochloris_marina_MBIC11017_uid58167,158339504,491,1,491,COG0001,0,
379012832,Acetobacterium_woodii_DSM_1030_uid88073,379012832,430,1,430,COG0001,0,
302391336,Acetohalobium_arabaticum_DSM_5501_uid51423,302391336,441,1,441,COG0001,0,
311103820,Achromobacter_xylosoxidans_A8_uid59899,311103820,425,1,425,COG0001,0,
332795879,Acidianus_hospitalis_W1_uid66875,332795879,369,1,369,COG0001,0,
332796307,Acidianus_hospitalis_W1_uid66875,332796307,416,1,416,COG0001,0,
37527581,Acetohalobium_arabaticum_DSM_5501_uid51423,302391336,441,1,441,COG0001,0

,

File3

    COG0001 H   Glutamate-1-semialdehyde aminotransferase
COG0002 E   N-acetyl-gamma-glutamylphosphate reductase
COG0003 P   Anion-transporting ATPase, ArsA/GET3 family
COG0004 P   Ammonia channel protein AmtB
COG0005 F   Purine nucleoside phosphorylase
COG0006 E   Xaa-Pro aminopeptidase
COG0007 H   Uroporphyrinogen-III methylase (siroheme synthase)
COG0008 J   Glutamyl- or glutaminyl-tRNA synthetase
COG0009 J   tRNA A37 threonylcarbamoyladenosine synthetase subunit TsaC/SUA5/YrdC
COG0010 E   Arginase family enzyme
COG0011 S   Uncharacterized conserved protein YqgV, UPF0045/DUF77 family
COG0012 J   Ribosome-binding ATPase YchF, GTP1/OBG family

Expected outcome :

WP_082277722.10 gi|37527581|ref|NP_930925.1|  37527581,Acetohalobium_arabaticum_DSM_5501_uid51423,302391336,441,1,441,COG0001,0  COG0001    H   Glutamate-1-semialdehyde aminotransferase

command line programming • 2.3k views

ADD COMMENT • link updated 7.0 years ago by Ram 45k • written 7.0 years ago by v27 ▴ 10

0

Entering edit mode

what have you tried ?

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

i can do this in excel ..... but i have 2500 files for which i need to repeat the procedure .

ADD REPLY • link 7.0 years ago by v27 ▴ 10

3

Entering edit mode

enter image description here

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

Stealing that.... :P

ADD REPLY • link 7.0 years ago by Joe 22k

1

Entering edit mode

Write a macro? You have to get into a programming mindset, end of the day it doesn't matter what tool you use, only matters how you use that tool and how well you know it. Excel is deceptively complex. You will encounter problems that you cannot explain, get to the root of or even reproduce at times. Excel does not preserve state or document logic. When it comes to problems such as yours, it's just a glorified calculator.

Use R or python, something that helps you learn the programming mindset and document your actions, as you will need to do this again in 6 months' time and programming means you'll have to spend only a fraction of this time then.

ADD REPLY • link 7.0 years ago by Ram 45k

0

Entering edit mode

Excel is a hammer. The problem before you is a screw. Yes, it's a big hammer, and if you swing hard enough, you can make Excel do what you want, but it would be smarter to find a screwdriver. R (or Rstudio might be a bit easier to work with) can do this.

ADD REPLY • link 7.0 years ago by swbarnes2 15k

0

Entering edit mode

This question is presented without a proper explanation in body of the post. Please edit the original question and add this information there.

Instead of just posting parts of three files you should spell out which parts of the three are supposed to contain a key that can be used to relate the records. I tried to match a couple three numbers from file 1 but could not do it visually.

WP_082277722.10 gi|384147149|ref|YP_005529965.1| Is the gi number 384147149 supposed to be in file 2? Then one needs to match the COG id from that line in file 2 with the list in file 3?

ADD REPLY • link 7.0 years ago by GenoMax 152k

0

Entering edit mode

The examples posted here seem incomplete, which OP should have mentioned. While OP does give an example of desired output, they should also point out the logic.

I suspect this is part of their learning the mindset though, so once they do learn they will no longer need any help with this sort of problem :-)

ADD REPLY • link 7.0 years ago by Ram 45k

0

Entering edit mode

Don't forget to follow up on your previous threads, multiple posts are without accepted answer:

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLY • link 7.0 years ago by Ram 45k

score 2 · Answer 1 · 2018-07-26

2

Entering edit mode

7.0 years ago

Ram 45k

Use R, not the command line. You will need to read in the data files, preprocess them a little to clean and get your identifiers in separate columns, and then use functions such as merge or dplyr joins to achieve your goal.

This is a good exercise, so I will not be giving you the code.

ADD COMMENT • link 7.0 years ago by Ram 45k

3

Entering edit mode

Use linux join , not R , You will need to read in the data files, extract a column with awk -F '|' '{printf("%s\t%s\n",$2,$0);}' then a little to clean and get your identifiers in separate columns.

This is a good exercise, so I will not be giving you the code.

;o)

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 166k

2

Entering edit mode

OP's going to need either a bunch of temporary files or a bunch of process substitutions. I'd go with R, this is sophisticated enough for a beginner to use a easier tool than plain linux, I think. I mean, you would obviously go for linux join and paste and whatnot, and I'd dabble too, but R is just easier here.

ADD REPLY • link 7.0 years ago by Ram 45k

0

Entering edit mode

of course, you're right, I'm just kidding :o)

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

could you please give me the code .. i have to repeat this for 2500 files. code would make it a lot easier for me

ADD REPLY • link 7.0 years ago by v27 ▴ 10

4

Entering edit mode

You should at least show some effort as Pierre pointed out above. He and Ram will no doubt be happy to help you, but not do your work for you ;)

You will never improve your skills if you don't try and get your hands dirty :)

ADD REPLY • link 7.0 years ago by Joe 22k

0

Entering edit mode

just running on a deadline right now...

ADD REPLY • link 7.0 years ago by v27 ▴ 10

0

Entering edit mode

OK, I will give you the code for file1 only. You should be able to do the rest yourself.

Let's say the file is called file1.txt. In R you'll need:

f1 <- read.table('file1.txt', header=FALSE, sep="\t");
f1 <- cbind(f1, t(data.frame(strsplit(f1[,2], split="|", fixed=TRUE))))

This will split the delimiters in file1, so you can merge objects.

step1 <- merge(x=f1, y=f2, by.x = <file1_colname>, by.y=<file2_column1name>, all=T)
step2 <- merge(x=step1, y=f3, by.x = <col_with_COGids>, by.y=<file3_column1name>, all=T)

step2 will have the data you need, which you can then use apply and paste(collapse=",") to get the remaining columns. Talk to your supervisor and see if you can get your deadline extended - this task might take a while.

Hint: Once you read file2.txt into f2, Just use apply and paste(collapse=",") on everything but the ID column before you merge. That way, your processing at the end can be avoided and you can just rearrange and omit columns to get to your desired output.