Alternative for grep in a for loop
0
0
Entering edit mode
15 months ago
Prangan ▴ 20

Hello Stars,

I have two files list1.txt and list2.txt which look like this:

cat list1.txt

AT4G38910 3:17541308-17542307
AT4G38910 3:17639717-17640716
AT4G24540 1:25400514-25401513
AT4G24540 1:3398359-3399358
AT1G27730 1:4463470-4463858
AT1G27730 1:10073550-10074358

cat list1.txt | wc -l
650000

and

cat list2.txt
MYB94 AT3G47600 3:17541308-17542307
VPS29 AT3G47810 3:17639717-17640716
GSTU17 AT1G10370 1:3398359-3399358
CYP71B29 AT1G13100 1:4463470-4463858
AT1G28660 AT1G28660 1:10073550-10074358
BPC5 AT4G38910 4:18147081-18147080
AGL24 AT4G24540 4:12674107-12675106
ZAT10 AT1G27730 1:9649324-9650323

cat list2.txt | wc -l
5000

I am trying to grep the list1 entries columnwise from the list2 entries to get an output like:

BPC5 AT4G38910 MYB94 AT3G47600 3:17541308-17542307
BPC5 AT4G38910 VPS29 AT3G47810 3:17639717-17640716
AGL24 AT4G24540 AT1G67750 AT1G67750 1:25400514-25401513
AGL24 AT4G24540 GSTU17 AT1G10370 1:3398359-3399358
ZAT10 AT1G27730 CYP71B29 AT1G13100 1:4463470-4463858
ZAT10 AT1G27730 AT1G28660 AT1G28660 1:10073550-10074358

For which I am doing:

for i in `cat list1.txt | awk '{print $1}'`; do grep $i list2.txt ; done | awk '{print $1,$2}' > l1.txt
for i in `cat list1.txt | awk '{print $2}'`; do grep $i list2.txt ; done > l2.txt
paste l1.txt l2.txt > results.txt

But I am aware that grep is unsuitable for this operation and is taking a lot of time to generate the output. I am looking for an alternative for doing this (maybe awk?) or maybe parallelizing this using xargs or parallel. Any help is highly appreciated.

linux • 1.1k views
ADD COMMENT
2
Entering edit mode

you don't want grep you want join. https://linux.die.net/man/1/join

ADD REPLY
0
Entering edit mode

Thanks for the reply. But in my case, both the columns in list1.txt contain repetitions, and the column1-column2 elements (which are meant to be network edges) in list1.txt do not correspond to the elements in list2.txt (which is kinda like an alias file for the network). What I want is:

  1. for each entry of column1 (list1.txt), if entry matches with column2 (of list2.txt), then print column1 and column2 of list2.txt > output1
  2. for each entry of column2 (list1.txt), if entry matches with column3 (of list2.txt), then print all columns of list2.txt > output2
  3. merge output1 & output2 to produce a 5 column output3

I am not sure if join can make that happen, considering I have repetitions in list1. Again, thanks for the help.

ADD REPLY
1
Entering edit mode

join handles repeats:

$ join -t. -1 1 -2 1 <(echo -e "A.1\nA.2\nA.3\nB.4") <(echo -e "A.X\nA.Y\nA.Z\nB.X")
A.1.X
A.1.Y
A.1.Z
A.2.X
A.2.Y
A.2.Z
A.3.X
A.3.Y
A.3.Z
B.4.X

blablablablablabla biostars wants text

ADD REPLY
0
Entering edit mode

Side note: backticks are a legacy way of performing command substitutions. It's time to move on to the $() way of doing this - it is a lot more elegant and what's more, it can be nested. Also, read up on UUoC. You could literally just use cut -f1 list1.txt instead of cat list1.txt | awk '{print $1}'.

ADD REPLY
0
Entering edit mode

Why don't you left join on R or python?

ADD REPLY
1
Entering edit mode

why reinventing the wheel ?

ADD REPLY
0
Entering edit mode

unix join is good enough for simple use cases. R/python are better for more complicated cases or for re-runnable pipelines.

ADD REPLY

Login before adding your answer.

Traffic: 3385 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6