To have grep
do exact-word matches from a file of strings, use -w -F
:
$ grep -w -F -f 1.txt 2.gff > 3.gff
The -w
option does word searches using regular expressions. The -F
option modifies this to do exact-string matching.
Using -w
on its own will consume a lot of memory, because it will look for words in 1.txt
that are contained in substrings of 2.gff
.
In other words, if you have a string like 12345
in 1.txt
, then using -w
on its own will match a larger set of strings that contain 12345
(what you want), 123456
, 1234567
, 012345
, etc. found in 2.gff
.
The first match to 12345
is desired, but all other matches that contain 12345
, like 123456
and 1234567
etc. are probably not what you want.
So by combining -w
and -F
, you get an exact match for the string you provide. So 12345
will only provide a hit on the word 12345
, and not any other matches where 12345
is a prefix or suffix or substring of something else.
As a bonus, using -F
makes grep
consume a great deal less memory. Regular expressions use lots of memory, but string matching does not need to.
grep -f
can indeed consume a ton of memory. It's nice that you received the "out of memory" error, I have crashed a 500Gb RAM server with a nastygrep -f
.jaqx008 : Are you
grep
'ing something super secret? Can you provide an actual example so we can avoid this endless back and forth? As @ATPoint said there may be an efficient way of doing what ever you are trying to do without usinggrep
.This is what I am trying to do. there are some gene IDs in a text file A, and lines possibly containing the gene IDs in gff file B. I am trying to identify the matches in B and outpur all the lines that match in B ( this should output all the columns in B. A has only one column of gene IDs B has has multiple colums and one of those columns has the gene IDs. so my command should pull out the corresponding columns of A in B. bellow is the command
Can you give us an idea of the numbers we're talking about here? file sizes? list length ?
For most cases a solution like offered by arup (== loop through your list and grep each of them) will solve your issue
OK. the text files rang between 100bytes to about 50kb with word count around 1500 and in this format
while the .gff range from about 600kb to about 900kb with wc of about 5000 to 7000 in the format bellow
gff-version 3
Also, I did tried the loop but it exits with the error
Where did you get your files from, which operating system? Did you open them in windows or so? I had to look myself but apparently this 'error' is related to the encoding of your file.
If you have dos2unix (or mac2unix) installed, you might run that first on your files to convert them to proper unix encoding
What is output of
ulimit -a
?oh I see. well ulimit -a output
Your account does not have a limit. So the error is due to possibilities enumerated by others.
What are you grep-ing? Maybe a more efficient way could be the use of tabix, depending on what you want to retrieve.
well I cant post more than 5 times in 6 hours and thats why I havent responded. BTW I am grep-ing a text file like
against a .gff file like
4634 - ID=2345 4353 + ID=3245 etc its working for some and giving memory complain for others.
Those patterns are too generic and likely generate many matches. That must be the reason why you are running out of memory. Are you trying to pull out specific genes?
Once you post for a certain number of times (get rep points) that posting limit should go up.
Yes I am trying to pull out certain genes IDs with there corresponding information. I posted in the comment above what my files look like. the thing is, it works fine for some and does not for others.
This section from GNU-parallel may help you in automatic piping of both the files or single file: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Parallel-grep. Manual covers each situation (eg. limited RAM, limited CPU etc)