Grep the first match for each line of a pattern file
3
0
Entering edit mode
7.9 years ago

Hi,

I have a little problem with the 'grep' tool. I have two files:


  • pattern_file:

id_gene_1

id_gene_2

id_gene_3

  • description_file:

id_gene_1 description_xxx

id_gene_2 description_yyy

id_gene_1 description_xxx

id_gene_3 description_zzz

id_gene_3 description_zzz

id_gene_2 description_yyy


I would like for each line of the 'pattern_file', look for the first match in the 'description_file'. I thought using the -f and -m grep option but I only get the first match.

Any idea ?

Thanks in advance

Grep • 11k views
ADD COMMENT
2
Entering edit mode

It's a good practice to give an example of your expected output. This is very helpful to have the desired answer. For other hand I would recommend you to have a look to stackoverflow forum since I'm pretty sure that this kind of question has been already asked before.

ADD REPLY
0
Entering edit mode

i'm confused. you said you want first matched, but only get the first match??

ADD REPLY
0
Entering edit mode

To be more precise, the 'description_file' looks like this:

description_file:

id_gene_1 description_1-1
id_gene_2 description_2-1
id_gene_1 description_1-2
id_gene_3 description_3-1
id_gene_3 description_3-2
id_gene_2 description_2-2

So, as output, I would like to have :

id_gene_1 description_1-1
id_gene_2 description_2-1
id_gene_3 description_3-1
ADD REPLY
0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

In this case you should have edited your original post and added the information there.

ADD REPLY
1
Entering edit mode
7.9 years ago
iraun 6.2k

This should work:

grep -f pattern_file description_file | awk -F" " '!_[$1]++'

The grep command matches common lines between two files. The awk command prints out only the first match. Please, change -F variable if your field separator in the decription file is not a white space.

ADD COMMENT
0
Entering edit mode
7.9 years ago

I would use a sort with option 'stable' and 'unique' followed by a join (here, using 'space' as the delimiter)

join -t ' ' -1 1 -2 1 <(sort pattern_file ) <(sort -t ' ' -k1,1 description --stable -u)
ADD COMMENT
0
Entering edit mode
4.4 years ago
michael ▴ 10

You were pretty close. This should do the trick.

xargs -I @ grep -w -m 1 @ description_file < pattern_file

You have to use xargs as using grep -m 1 on its own will stop printing any matches after the first one. Let's break down the command.

We pipe the pattern_file in to xargs with < pattern_file. The way I generally read a xargs statement is "for each line, do X". In this case, X is grep -w -m 1 @ description_file. The -I @ bit tells xargs that wherever I use the character @, insert the line (in this case the current pattern). As an example, if the current line being read from pattern_file was id_gene_2, then what xargs would execute is grep -w -m 1 id_gene_2 description_file. Lastly, the -w option tells grep "Select only those lines containing matches that form whole words." This is important because if your pattern is id_gene_1, without -w, grep would also match this pattern to id_gene_10 or 11 or 12 etc. as the pattern is present in them too.

ADD COMMENT

Login before adding your answer.

Traffic: 1658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6