Question

how to write a script to grep information from another file without losing location information(in the right order)

1

Entering edit mode

7.3 years ago

Anny ▴ 30

Hi all

I am too new with programing to solve this problem. I have two files, file1 containing the index, file2 includes information I want.

 1. file1

CU_91

CU_495

CW_79

CU_22

CW_42

 2. file2

CW_79   protein1

CW_15  protein2

CW_16  protein3

CW_17   protein4

CW_42   protein5

I want to add extra information from file 2 to file 1 without changing the order in file one, as following. How could I do that?

CU_91

CU_495

CW_79  protein1

CU_22

CW_42 protein5

Thank you!

Alexie

linux script • 2.1k views

ADD COMMENT • link updated 7.3 years ago by Pierre Lindenbaum 164k • written 7.3 years ago by Anny ▴ 30

score 3 · Answer 1 · 2017-08-01

3

Entering edit mode

7.3 years ago

russhh 5.7k

What you've described is a left-outer-join of the data in file1 with the data in file2. Have a look at the join command (example). If your file2 is tab-separated, I think you do the following:

join -t $'\t' file1 file2 -a1

For example,

echo -e "A\nB\nC" > f1
cat f1
A
B
C

echo -e "A\tP1\nC\tP2\nD\tP3" > f2
cat f2
A    P1
C    P2
D    P3



join -t $'\t' f1 f2 -a1 

A    P1
B
C    P2

The syntax is a bit awkward for specifying the separator IMO

ADD COMMENT • link 7.3 years ago by russhh 5.7k

1

Entering edit mode

Hi russhh!

Thank you for your help.

I tried this method but failed to get the result, I think there are two problems 1)I can't sort file 1 since I need the order information 2)For some reason, my system is not recognizing "join -t $'\t'" and gave the error message "join: illegal tab character specification". I changed file two with command sed 's/ /\t/g'

ADD REPLY • link 7.3 years ago by Anny ▴ 30

1

Entering edit mode

I believe 'join' requires the input to be sorted, but Alexei want's to maintain the order.

I don't know of a good way to do it that doesn't require writing a program and keeping stuff in memory (or something similar).

ADD REPLY • link 7.3 years ago by Malcolm ▴ 10

score 3 · Answer 2 · 2017-08-01

3

Entering edit mode

7.3 years ago

Pierre Lindenbaum 164k

assuming the tab is the delimiter. The first awk is used to keep the line number of the first file.

 join -t $'\t' -a 1 -1 2 -2 1  \
         <(awk '{printf("%d\t%s\n",NR,$1);}' file.1  | sort -t $'\t' -k2,2) \
         <(sort -t $'\t' -k1,1 file.2) |\
      sort -t $'\t' -k2,2n | cut -f 1,3

CU_91
CU_495
CW_79   protein1
CU_22
CW_42   protein5