Question

How To Join Text Files By A Certain Column (Accession Id)

3

Entering edit mode

12.2 years ago

2011101101 ▴ 110

I don't know how to say it. I have an example and I hope everyone understand me. There two files,1.txt,2.txt . The 1.txt is like this.

gi|25254180|gb|CA674435.1|CA674435    gi|345895254|ref|YP_004841985.1|    96.30    108    4    0    22    345    16    123    4e-51     204
gi|93045967|gb|CJ563048.1|CJ563048    gi|225427470|ref|XP_002270678.1|    63.27    98    34    1    617    330    124    221    3e-48     141
gi|39557933|gb|CK195543.1|CK195543    gi|357139181|ref|XP_003571163.1|    87.95    83    10    0    739    491    64    146    1e-35     148

The 2.txt is this

 gi|93045967|gb|CJ563048.1|CJ563048

The result file is 3.txt ,it is this

gi|93045967|gb|CJ563048.1|CJ563048    gi|225427470|ref|XP_002270678.1|    63.27    98    34    1    617    330    124    221    3e-48     141

That's all .Thank you .

extraction • 6.8k views

ADD COMMENT • link updated 3.4 years ago by xiaoleiusc ▴ 140 • written 12.2 years ago by 2011101101 ▴ 110

score 13 · Answer 1 · 2013-02-04

13

Entering edit mode

12.2 years ago

lh3 33k

In this forum, it is a frequent pitfall to use grep only with -f. I used to give my answer Retrieve Single Vcf From List Of Snp Rs#, but I feel it would be better to expand that answer a little bit.

grep has an option -f. According to its manpage: -f FILE: Obtain patterns from FILE, one per line. Each line is interpreted as a regex pattern. I speculate in that case, grep -f will load all the patterns in RAM and for each input line it attempts each pattern in turn. Thus if there are 1000 lines in the pattern file, grep -f will be 1000 times as slow as grepping one pattern only (i.e. linear in the number of patterns).

grep also has an option -F, which is equivalent to fgrep. fgrep uses the Aho–Corasick algorithm, an algorithm that constructs one DFA for all the pattern strings. When grep push an input line through the DFA, all patterns are attempted at the same time. Thus the speed of grep -F is largely independent of the number of patterns (i.e. constant to the number of patterns). For fixed string matching, we should use grep -Ff; otherwise the speed will be extremely slow (see the benchmark in Retrieve Single Vcf From List Of Snp Rs#). In addition, it is recommended to add option -w to reduce false matching.

I have fallen to this pitfall for many years myself. I used to ask my colleagues not to use grep -f because it is slow. But one day, I suddenly felt something wrong: Aho–Corasick algorithm should be as fast as grepping one pattern; why grep -f is so slow? When I read the manpage, it became obvious that I was misusing grep for a long time. On the other hand, I have to say the design of both -f and -F is really confusing.

EDIT: the approximate answer is:

grep -wFf 2.txt 1.txt > 3.txt

which works most of time, but may have false positives in rare cases where the patterns appear in other columns. The accurate answer is:

awk 'BEGIN{while((getline<"2.txt")>0)l[$1]=1}l[$1]' 1.txt > 3.txt

which works all the time.

ADD COMMENT • link 12.2 years ago by lh3 33k

0

Entering edit mode

grep would be a nicer design if -w didn't break up words except on whitespace. The '|' and other common symbols are word breaks. I agree, it's easy to be burned. I just wish -w was a bit different, or configurable in some way. It would make scenarios like these easier.

ADD REPLY • link 12.2 years ago by sjneph ▴ 690

0

Entering edit mode

You say here to add -w to reduce false positive matches, but in the linked answer you say it may create false positive matches. That is confusing, it seems like the former would be correct and I don't understand how this changes the output if -F is treating input as fixed strings. Seems like -F alone should do it, but it's hard to tell looking at one example.

ADD REPLY • link 12.2 years ago by SES 8.6k

1

Entering edit mode

The -F can match shorter strings, and the -w could fix that if it behaved more nicely. If the file has "hugabear papa" and you use something like grep -F -e "huga" file

It'll match. The -w will enforce that an entire word matches: You'd need to specify -e "hugabear" (or -e "papa") to match. As I stated before, -w has limitations in that it breaks up words on many non-whitespace characters, which essentially degenerates back to using just the -F (with the same match-on-shorter-words problem).

ADD REPLY • link 12.2 years ago by sjneph ▴ 690

0

Entering edit mode

Good example, I can see clearly how the -w matches differently now.

ADD REPLY • link 12.2 years ago by SES 8.6k

0

Entering edit mode

if I have a header in both 1.txt and 2.txt, does this command still work?

ADD REPLY • link 3.4 years ago by xiaoleiusc ▴ 140

score 4 · Answer 2 · 2013-02-05

4

Entering edit mode

12.2 years ago

Sebastian Kurscheid ▴ 300

Just to add another potential solution [making the assumption that we are continuing to live in a UNIX world]:

Command:

sort -k 1 1.txt | join 2.txt -

Result:

gi|93045967|gb|CJ563048.1|CJ563048 gi|225427470|ref|XP_002270678.1| 63.27 98 34 1 617 330 124 221 3e-48 141

HTH

If 2.txt has more than one entry it is necessary to also sort it by its key (e.g. first column) by executing

sort -k 1 2.txt > 2.sorted.txt

and modifying above command to:

sort -k 1 1.txt | join 2.sorted.txt -

ADD COMMENT • link 12.2 years ago by Sebastian Kurscheid ▴ 300

0

Entering edit mode

This problem is so simple to solve with sort and join that I wouldn't be arsed messing around with grep and fgrep, personally

ADD REPLY • link 12.2 years ago by Daniel ★ 4.0k

0

Entering edit mode

Typically, 1.txt is a huge file or is already sorted by other fields. Sorting 1.txt back and forth is clumsy, inefficient and may be quite complicated (e.g. in case of VCF, you wouldn't want to sort the header). Unless 2.txt is too large to fit in RAM, the right solution is to load 2.txt in memory and filter 1.txt line by line. Join only becomes better when 2.txt is too large, which rarely happens. As to grep, that is the most efficient solution if you are sure there are no false positives.

ADD REPLY • link 12.2 years ago by lh3 33k

0

Entering edit mode

Good comment regarding the memory efficiency, but there are quite a lot of assumptions that you are making about the structure/organisation of the data. If every computational problem would follow "typical" cases we bioinformaticians would be pretty much out of work. We were given a very specific question, with very limited data, and several solutions to this particular problem exist. This is one of them.

ADD REPLY • link 12.2 years ago by Sebastian Kurscheid ▴ 300

1

Entering edit mode

Your answer is fine. I have no problems with that. I was responding to the previous comment. For this question, grep/awk is simpler and more efficient. Recommending sort/join for this question while criticizing grep is misleading. BTW, a better version of your answer is: sort -k1 1.txt | join <(sort 2.txt) -.

ADD REPLY • link 12.2 years ago by lh3 33k

0

Entering edit mode

Sorry, I obviously misread your comment - and yes, it's a better version :-)

ADD REPLY • link 12.2 years ago by Sebastian Kurscheid ▴ 300

0

Entering edit mode

I wasn't trying to criticise grep, only that in cases such as this I find it much easier with sort and join etc. as it's straight forward and does what it says on the tin. I wasn't considering efficiencies as I've never found myself waiting any prolonged time so as to consider an alternative.

ADD REPLY • link 12.2 years ago by Daniel ★ 4.0k

score 1 · Answer 3 · 2013-02-04

1

Entering edit mode

12.2 years ago

SES 8.6k

Here is one solution (edit: see also Retrieve Single Vcf From List Of Snp Rs# for an explanation of -F [fgrep]):

fgrep -f 2.txt 1.txt > 3.txt

ADD COMMENT • link 12.2 years ago by SES 8.6k

0

Entering edit mode

also add a -F to make things faster since you're matching on fixed strings (and not regular expressions) fgrep -F -f 2.txt 1.txt > 3.txt

ADD REPLY • link 12.2 years ago by sjneph ▴ 690

1

Entering edit mode

I think fgrep searches fixed strings from -f file. Same as grep -Ff, am I wrong?

ADD REPLY • link 12.2 years ago by SES 8.6k

0

Entering edit mode

Ah, yes, you are correct: fgrep is the same as grep -Ff. I often forget the various related grep-ish family programs. Sctoopid sjneph. Good catch.

ADD REPLY • link 12.2 years ago by sjneph ▴ 690

score 1 · Answer 4 · 2013-02-05

1

Entering edit mode

12.2 years ago

2011101101 ▴ 110

Thank you .I have get the answer fgrep -w -Ff 2.txt 1.txt > 3.txt

ADD COMMENT • link 12.2 years ago by 2011101101 ▴ 110

0

Entering edit mode

If you end up going with this specific solution you'll probably want to leave off the -w, as discussed above.

ADD REPLY • link 12.2 years ago by SES 8.6k

0

Entering edit mode

Why? I think we should use -w. If 1.txt contains a line abcgi|93045967|gb|CJ563048.1|CJ563048, grep -wFf will not match, but grep -Ff will give a false match.

ADD REPLY • link 12.2 years ago by lh3 33k

0

Entering edit mode

I was concerned that it would break up the words based on previous comments but I guess fgrep will enforce that it doesn't?

ADD REPLY • link 12.2 years ago by SES 8.6k