Entering edit mode
15 months ago
Prangan
▴
20
Hello Stars,
I have two files list1.txt
and list2.txt
which look like this:
cat list1.txt
AT4G38910 3:17541308-17542307
AT4G38910 3:17639717-17640716
AT4G24540 1:25400514-25401513
AT4G24540 1:3398359-3399358
AT1G27730 1:4463470-4463858
AT1G27730 1:10073550-10074358
cat list1.txt | wc -l
650000
and
cat list2.txt
MYB94 AT3G47600 3:17541308-17542307
VPS29 AT3G47810 3:17639717-17640716
GSTU17 AT1G10370 1:3398359-3399358
CYP71B29 AT1G13100 1:4463470-4463858
AT1G28660 AT1G28660 1:10073550-10074358
BPC5 AT4G38910 4:18147081-18147080
AGL24 AT4G24540 4:12674107-12675106
ZAT10 AT1G27730 1:9649324-9650323
cat list2.txt | wc -l
5000
I am trying to grep the list1 entries columnwise from the list2 entries to get an output like:
BPC5 AT4G38910 MYB94 AT3G47600 3:17541308-17542307
BPC5 AT4G38910 VPS29 AT3G47810 3:17639717-17640716
AGL24 AT4G24540 AT1G67750 AT1G67750 1:25400514-25401513
AGL24 AT4G24540 GSTU17 AT1G10370 1:3398359-3399358
ZAT10 AT1G27730 CYP71B29 AT1G13100 1:4463470-4463858
ZAT10 AT1G27730 AT1G28660 AT1G28660 1:10073550-10074358
For which I am doing:
for i in `cat list1.txt | awk '{print $1}'`; do grep $i list2.txt ; done | awk '{print $1,$2}' > l1.txt
for i in `cat list1.txt | awk '{print $2}'`; do grep $i list2.txt ; done > l2.txt
paste l1.txt l2.txt > results.txt
But I am aware that grep is unsuitable for this operation and is taking a lot of time to generate the output. I am looking for an alternative for doing this (maybe awk?) or maybe parallelizing this using xargs or parallel. Any help is highly appreciated.
you don't want
grep
you wantjoin
. https://linux.die.net/man/1/joinThanks for the reply. But in my case, both the columns in list1.txt contain repetitions, and the column1-column2 elements (which are meant to be network edges) in list1.txt do not correspond to the elements in list2.txt (which is kinda like an alias file for the network). What I want is:
I am not sure if join can make that happen, considering I have repetitions in list1. Again, thanks for the help.
join handles repeats:
blablablablablabla biostars wants text
Side note: backticks are a legacy way of performing command substitutions. It's time to move on to the
$()
way of doing this - it is a lot more elegant and what's more, it can be nested. Also, read up on UUoC. You could literally just usecut -f1 list1.txt
instead ofcat list1.txt | awk '{print $1}'
.Why don't you left join on R or python?
why reinventing the wheel ?
unix join is good enough for simple use cases. R/python are better for more complicated cases or for re-runnable pipelines.