Question

grep unexpected behaviour

2

Entering edit mode

9.8 years ago

Illinu ▴ 110

I have a list of sequence names:

c32026_g2_i1
c43297_g1_i9
c45863_g2_i2
c43297_g1_i10
c35765_g2_i1
c44444_g3_i1
...

I want to take each sequence from the list and output the annotation of that sequence extracting it from the annotations file

This is what I do and it doesn't work well, for example in the list there is:

c38478_g1_i5
c38478_g1_i4
c38478_g1_i17
c38478_g1_i9
c38478_g1_i18
c38478_g1_i1

and it outputs:

c38478_g1_i5    1018    Q3B724    ...
c38478_g1_i4    1000    Q3B724    ...
c38478_g1_i17    887    Q3B724    ...
c38478_g1_i9    1007    Q3B724    ...
c38478_g1_i18    738    Q3B724    ...
c38478_g1_i1    496    -    -    -    - ...
c38478_g1_i10    950    -    -    -   ...
c38478_g1_i11    249    Q3B724    ...
c38478_g1_i12    706    -    -    -    ...
c38478_g1_i13    654    -    -    -    ...
c38478_g1_i14    809    -    -    -  ...
c38478_g1_i15    217    -    -    -  ...
c38478_g1_i16    788    Q3B724    ...
c38478_g1_i17    887    Q3B724    ...
c38478_g1_i18    738    Q3B724    ...
c38478_g1_i19    548    -    -    -    ...

for f in $(cat list.OE.txt); do grep $f Trinity_uniref_2015_02_filt_ann_out.txt; done > OE.annocript.txt

grep • 2.7k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Illinu ▴ 110

3

Entering edit mode

What are the expected input and output? If you are just greping the list from a file, and your list are store in a file, let's say, list.txt, then you can always do

grep -wf list.txt Trinity_uniref_2015_02_filt_ann_out.txt > OE.annocript.txt

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Sam ★ 4.8k

1

Entering edit mode

You need "-wFf". When list.txt is huge, "-F" will be much faster.

ADD REPLY • link 9.8 years ago by lh3 33k

0

Entering edit mode

Yes, that worked :) Thank you

ADD REPLY • link 9.8 years ago by Illinu ▴ 110

Ram · Answer 1 · 2015-03-17

3

Entering edit mode

9.8 years ago

5heikki 11k

This is expected behavior since you didn't use the -w flag. man grep

For example,

grep c38478_g1_i1

Would return such lines:

c38478_g1_i11
c38478_g1_i1111111
c38478_g1_i122343454
as_long_as_c38478_g1_i1_is_on_the_line

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by 5heikki 11k

0

Entering edit mode

Great, I see! thank you

ADD REPLY • link 9.8 years ago by Illinu ▴ 110

Ram · Answer 2 · 2015-03-17

2

Entering edit mode

9.8 years ago

tszn1984 ▴ 100

Avoid grep in a loop. This is O(MxN) time complexity.

A more general solution: use hash to store the ids, then check if the data file has that id in a specific column.

awk 'BEGIN{FS=OFS="\t";while(getline<"ids.lst") ids[$1] =1}{if(ids[$1]==1) print}' full_anno.txt

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by tszn1984 ▴ 100

0

Entering edit mode

A more general description for such case:

Data file: a matrix file, with ID in the Nth column
ID list file: a list of IDs

>selectIn datafile  IDfile N

selectIn source file

###################
#!/bin/sh
#Last-modified: 13 Apr 2013 03:02:05 PM
USAGE=" Usage: $0 Data.txt id.lst [col=1]"
case $# in
    0) echo $USAGE
       exit;;
    1) echo $USAGE
       exit;;
    2) col=1;;
    *) col=$3;;
esac
data=$1
awk -vLST=$2 -vCOL=$col 'BEGIN{FS=OFS="\t";while(getline<LST) count[$1]=1}{if(count[$(COL)]==1) print}' $data
###################

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by tszn1984 ▴ 100

score 0 · Answer 3 · 2015-03-17

0

Entering edit mode

9.8 years ago

Illinu ▴ 110

Just to update, this form:

for f in $(cat list.OE.txt); do grep -w $f Trinity_uniref_2015_02_filt_ann_out.txt; done > OE.annocript.txt

works A LOT faster than this one:

grep -wf list.OE.txt Trinity_uniref_2015_02_filt_ann_out.txt > OE.annocript.txt

ADD COMMENT • link 9.8 years ago by Illinu ▴ 110

0

Entering edit mode

If you have large files it will be extremely slow anyway since you're searching the entire file again and again. For such tasks, there's man join

ADD REPLY • link 9.8 years ago by 5heikki 11k