In this forum, it is a frequent pitfall to use grep
only with -f
. I used to give my answer Retrieve Single Vcf From List Of Snp Rs#, but I feel it would be better to expand that answer a little bit.
grep
has an option -f
. According to its manpage: -f FILE: Obtain patterns from FILE, one per line
. Each line is interpreted as a regex pattern. I speculate in that case, grep -f
will load all the patterns in RAM and for each input line it attempts each pattern in turn. Thus if there are 1000 lines in the pattern file, grep -f
will be 1000 times as slow as grepping one pattern only (i.e. linear in the number of patterns).
grep
also has an option -F
, which is equivalent to fgrep
. fgrep
uses the Aho–Corasick algorithm, an algorithm that constructs one DFA for all the pattern strings. When grep push an input line through the DFA, all patterns are attempted at the same time. Thus the speed of grep -F
is largely independent of the number of patterns (i.e. constant to the number of patterns). For fixed string matching, we should use grep -Ff
; otherwise the speed will be extremely slow (see the benchmark in Retrieve Single Vcf From List Of Snp Rs#). In addition, it is recommended to add option -w
to reduce false matching.
I have fallen to this pitfall for many years myself. I used to ask my colleagues not to use grep -f
because it is slow. But one day, I suddenly felt something wrong: Aho–Corasick algorithm should be as fast as grepping one pattern; why grep -f
is so slow? When I read the manpage, it became obvious that I was misusing grep for a long time. On the other hand, I have to say the design of both -f
and -F
is really confusing.
EDIT: the approximate answer is:
grep -wFf 2.txt 1.txt > 3.txt
which works most of time, but may have false positives in rare cases where the patterns appear in other columns. The accurate answer is:
awk 'BEGIN{while((getline<"2.txt")>0)l[$1]=1}l[$1]' 1.txt > 3.txt
which works all the time.
grep would be a nicer design if -w didn't break up words except on whitespace. The '|' and other common symbols are word breaks. I agree, it's easy to be burned. I just wish -w was a bit different, or configurable in some way. It would make scenarios like these easier.
You say here to add
-w
to reduce false positive matches, but in the linked answer you say it may create false positive matches. That is confusing, it seems like the former would be correct and I don't understand how this changes the output if-F
is treating input as fixed strings. Seems like-F
alone should do it, but it's hard to tell looking at one example.The -F can match shorter strings, and the -w could fix that if it behaved more nicely. If the file has "hugabear papa" and you use something like
grep -F -e "huga" file
It'll match. The -w will enforce that an entire word matches: You'd need to specify -e "hugabear" (or -e "papa") to match. As I stated before, -w has limitations in that it breaks up words on many non-whitespace characters, which essentially degenerates back to using just the -F (with the same match-on-shorter-words problem).
Good example, I can see clearly how the
-w
matches differently now.if I have a header in both 1.txt and 2.txt, does this command still work?