Question

Grep the first match for each line of a pattern file

0

Entering edit mode

8.3 years ago

cyril.noel10 • 0

Hi,

I have a little problem with the 'grep' tool. I have two files:

pattern_file:

id_gene_1

id_gene_2

id_gene_3

description_file:

id_gene_1 description_xxx

id_gene_2 description_yyy

id_gene_1 description_xxx

id_gene_3 description_zzz

id_gene_2 description_yyy

I would like for each line of the 'pattern_file', look for the first match in the 'description_file'. I thought using the -f and -m grep option but I only get the first match.

Any idea ?

Thanks in advance

Grep • 12k views

ADD COMMENT • link updated 4.9 years ago by michael ▴ 10 • written 8.3 years ago by cyril.noel10 • 0

2

Entering edit mode

It's a good practice to give an example of your expected output. This is very helpful to have the desired answer. For other hand I would recommend you to have a look to stackoverflow forum since I'm pretty sure that this kind of question has been already asked before.

ADD REPLY • link 8.3 years ago by iraun 6.2k

0

Entering edit mode

i'm confused. you said you want first matched, but only get the first match??

ADD REPLY • link 8.3 years ago by shenwei356 8.7k

0

Entering edit mode

To be more precise, the 'description_file' looks like this:

description_file:

id_gene_1 description_1-1
id_gene_2 description_2-1
id_gene_1 description_1-2
id_gene_3 description_3-1
id_gene_3 description_3-2
id_gene_2 description_2-2

So, as output, I would like to have :

id_gene_1 description_1-1
id_gene_2 description_2-1
id_gene_3 description_3-1

ADD REPLY • link updated 4.9 years ago by Ram 45k • written 8.3 years ago by cyril.noel10 • 0

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

In this case you should have edited your original post and added the information there.

ADD REPLY • link 8.3 years ago by GenoMax 151k

0

Entering edit mode

8.3 years ago

Pierre Lindenbaum 166k

I would use a sort with option 'stable' and 'unique' followed by a join (here, using 'space' as the delimiter)

join -t ' ' -1 1 -2 1 <(sort pattern_file ) <(sort -t ' ' -k1,1 description --stable -u)

ADD COMMENT • link 8.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

4.9 years ago

michael ▴ 10

You were pretty close. This should do the trick.

xargs -I @ grep -w -m 1 @ description_file < pattern_file

You have to use xargs as using grep -m 1 on its own will stop printing any matches after the first one. Let's break down the command.

We pipe the pattern_file in to xargs with < pattern_file. The way I generally read a xargs statement is "for each line, do X". In this case, X is grep -w -m 1 @ description_file. The -I @ bit tells xargs that wherever I use the character @, insert the line (in this case the current pattern). As an example, if the current line being read from pattern_file was id_gene_2, then what xargs would execute is grep -w -m 1 id_gene_2 description_file. Lastly, the -w option tells grep "Select only those lines containing matches that form whole words." This is important because if your pattern is id_gene_1, without -w, grep would also match this pattern to id_gene_10 or 11 or 12 etc. as the pattern is present in them too.

ADD COMMENT • link 4.9 years ago by michael ▴ 10

score 1 · Accepted Answer · 2017-01-12

1

Entering edit mode

8.3 years ago

iraun 6.2k

This should work:

grep -f pattern_file description_file | awk -F" " '!_[$1]++'

The grep command matches common lines between two files. The awk command prints out only the first match. Please, change -F variable if your field separator in the decription file is not a white space.

ADD COMMENT • link 8.3 years ago by iraun 6.2k