Question

Match And Merge Information By Id In Two Files

4

Entering edit mode

11.6 years ago

liupfskygre ▴ 210

Hi all,

I have two files (a.txt; b.txt), columns were separate by tab,

a.txt:

id value1 value2 value3

g1 v1 v2 good

g2 v3 v4 better

b.txt:

id value1 value2

g1 v1 v2

g3 v3 v4

I want to use id as the keywords on use perl to search the same id in the two files,

if id in a==id in b, like g1

then append value3 of g1 in a.txt to the end of line g1 in b.txt

I want to use this method to deal with my GO-gene association file and gff files, and I am just beginning to use Perl. I know it is not a good manner to ask for scripts, so just me some hints on where to begin. Any suggestions are appreciated!

perl id • 14k views

ADD COMMENT • link updated 8.2 years ago by Rashedul Islam ▴ 480 • written 11.6 years ago by liupfskygre ▴ 210

Ram · Answer 1 · 2014-01-19

1

Entering edit mode

11.6 years ago

Pavel Senin ★ 1.9k

if your files (sets) are not huge, you can handle this easily using R, without resorting to Perl:

# read data in
a = read.table("~/tmp/a.txt",header=T)
b = read.table("~/tmp/b.txt",header=T)

# merge datasets
merge(a,b,by="id")
  id value1.x value2.x value3 value1.y value2.y
1 g1       v1       v2   good       v1       v2


# extract results
c = merge(a,b,by="id")
res=c[,c(1,6,5,4)]
res
  id value2.y value1.y value3
1 g1       v2       v1   good

# produce output
write.table(res, file="~/tmp/c.txt",quote=F)

R is more interactive - easier to re-run and take care about other things (comparison constraints, etc), plus it allows to run script in batch mode. But Perl is more streamlined and command-line oriented if you after that.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 11.6 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

Thank you for your reply! I have a look at your profile, and found, wow, I have introduced several of your paper in my lab journal club about Verrucomicrobia!

ADD REPLY • link 11.6 years ago by liupfskygre ▴ 210

0

Entering edit mode

thank you, those are not exactly mine, but our team effort, I just performed some data processing

ADD REPLY • link 11.6 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

another questions, how could I keep the rows not include in the merge file, I mean data (a-c) and (b-c)? Thank you!

ADD REPLY • link 11.6 years ago by liupfskygre ▴ 210

0

Entering edit mode

I read Phil Spector's book Data Manipulation with R, I found 'merge' has several parameters which I could use of! Thanks!

ADD REPLY • link 11.6 years ago by liupfskygre ▴ 210

0

Entering edit mode

RStudio is a nice way to keep things tidy. If in R you type ?merge it will show you detailed help about the command.

ADD REPLY • link 11.6 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

Thanks you for this tip!

ADD REPLY • link 11.6 years ago by liupfskygre ▴ 210

0

Entering edit mode

Merging the two files and then parsing the results for this task is not a good solution--in any language. Also, your comments about Perl are rather puzzling.

ADD REPLY • link 11.6 years ago by Kenosis ★ 1.3k

0

Entering edit mode

Let me step back a bit. I think that Intersection) is one of the standard (basic) operations from sets theory, and can be handled in a variety of ways in any language. Not handling this, is the "suicide" for any language. R is interactive, Perl is not. edit: biostars doesnt handle that wikipedia link, sorry.

ADD REPLY • link 11.6 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

Add only what you need to add--and nothing more. Adding the informational 'noise' by merging everything subsequently required removing that 'noise.' I still have no idea what you mean by "R is interactive, Perl is not."

ADD REPLY • link 11.6 years ago by Kenosis ★ 1.3k

0

Entering edit mode

By default, R starts interactively, while Perl is not really designed that way. I might be wrong, but merging (joining) two sets in a deterministic way requires some sort of indexing or sorting and is computationally expensive probably not better than O(nlogn), unless you are ok with some errors. In my solution I read everything in the memory, but merge operation took care about sorting and selection.

ADD REPLY • link 11.6 years ago by Pavel Senin ★ 1.9k

Ram · Answer 2 · 2014-01-19

1

Entering edit mode

11.6 years ago

matted 7.8k

There are many ways to do this join task. To avoid small bugs from writing your own custom code, and the probable inefficiencies of a naive implementation, you could use the join utility on a Linux system. It should do what you want.

$ join -j 1 -t " " a.txt b.txt
id value1 value2 value3 value1 value2
g1 v1 v2 good v1 v2

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 11.6 years ago by matted 7.8k

0

Entering edit mode

good solution. you may want to add an awk call to shuffle columns in a way OP wants.

ADD REPLY • link 11.6 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

The OP's wanted output line pattern consists of four fields:

g1    v1    v2    good

Your routine's output adds two extra fields:

g1 v1 v2 good v1 v2

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 11.6 years ago by Kenosis ★ 1.3k

0

Entering edit mode

This is trivial to fix...

ADD REPLY • link 11.6 years ago by matted 7.8k

0

Entering edit mode

Why knowingly leave an erroneous answer for the OP to fix?

ADD REPLY • link 11.6 years ago by Kenosis ★ 1.3k

Ram · Answer 3 · 2017-05-26

1

Entering edit mode

8.2 years ago

Rashedul Islam ▴ 480

You can use dplyr in R as well.

a = read.table("~/tmp/a.txt",header=T)  
b = read.table("~/tmp/b.txt",header=T)

library(dplyr)  
inner_join(a, b)

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 8.2 years ago by Rashedul Islam ▴ 480

Ram · Answer 4 · 2014-01-19

0

Entering edit mode

11.6 years ago

Kenosis ★ 1.3k

This can be done in two steps: 1) split each line of a.txt and build a hash, where the id is a key and value3 is the key's associated value, and 2) split each line of b.txt and print the line plus value3 if the current id exists as a key in the hash:

use strict;
use warnings;

my ( @cols, %hash );

while (<>) {
    $hash{ $cols[0] } = $cols[-1] if @cols = split;
    last if eof;
}

while (<>) {
    print +( join "\t", @cols, $hash{ $cols[0] } ), "\n"
      if @cols = split and exists $hash{ $cols[0] };
}

Usage: perl script.pl a.txt b.txt [>outFile.txt]

The last, optional parameter directs output to a file.

The last if eof; is placed within the first while since both files would be completely read without it. The + after the print in the second while helps perl (the interpreter) disambiguate what to do.

This one-liner will produce the same results:

perl -ane '$h{$F[0]}=$F[-1];last if eof;END{while(<>){@F=split; s/\K$/\t$h{$F[0]}/ and print if$h{$F[0]}}}' a.txt b.txt [>outFile.txt]

Hope this helps!

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 11.6 years ago by Kenosis ★ 1.3k

0

Entering edit mode

I am astonished by all the discussions and suggestiones you brought. For a beginner, I really learned a lot from this. I have tried the merge way in R, use M=merge(a, b, all=TRUE), a, b is my gene list and DE list, they have a common id column. there about 3000 lines in a and b. after this merge, I think they will merge all things of a and b by the id column, and thing all indentical in a or b would be set NA, as the ?merge said. but the results showed that before 1095 lines all things were what I wanted, but after that all were messed！ I do not know whether merge has limitation for lines or file size. By the way, I think this link illustrate merge very nice http://dss.princeton.edu/training/Merge101R.pdf

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 11.6 years ago by liupfskygre ▴ 210

0

Entering edit mode

after use Rstudio to import things without using command, I found it worked:

import dataset-->import from txt-->heading, yes; tab, period, none

command showed as following:

> HZ254_phzH_Deseq.out <- read.delim("~/HZ254_phzH_Deseq out.txt", quote="")
> View(HZ254_phzH_Deseq.out)
> HZ254_genedata_ensembl <- read.delim("~/HZ254_genedata_ensembl.txt", quote="")
> View(HZ254_genedata_ensembl)
> M=merge(HZ254_phzH_Deseq.out,HZ254_genedata_ensembl,all=TRUE)
> write.table(M,file='HZ254_DEfull1.xls',row.names=F,quote=F,sep='\t')

command causing mess I mentioned above as following:

a= read.table("HZ254_phzH_Deseq out.txt",header=T,sep='\t')
b= read.table("HZ254_genedata_ensembl.txt",header=T,sep='\t') 
M=merge(a,b,all=TRUE)

and I also use the command derived from Rstudio, it also worked.

So, I think the problem is the way of data import, what is difference between the them?

Thanks!

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 11.6 years ago by liupfskygre ▴ 210

0

Entering edit mode

Hi liupfskygre.

My apologies for responding so late, but just saw your replies.

Am glad you've found the discussions here beneficial. I've seen so much expertise and have been quite impressed. Am not familiar with Rstudio, so I can't comment about the data import issue.

The best to you.

ADD REPLY • link 11.6 years ago by Kenosis ★ 1.3k