Question

Compare between two columns of two different files and print common

0

Entering edit mode

7.5 years ago

bk11 ★ 3.1k

Hi I have two files,

File1

Chr1 NC_029516.1 19961 - LRRC16B
Chr1 NC_029516.1 23408 - LOC107314597
Chr1 NC_029516.1 31174 + LOC107305688
Chr1 NC_029516.1 156107 - ACR
Chr1 NC_029516.1 193750 + LOC107319936
Chr1 NC_029516.1 457876 + SHANK3
Chr1 NC_029516.1 549801 - DENND6B
Chr1 NC_029516.1 575842 - LOC107323953

File2

Chr1    NC_029516.1     967     5379    -       RNF212B
Chr1    NC_029516.1     5341    19961   -       LRRC16B
Chr1    NC_029516.1     20428   23408   -       LOC107314597
Chr1    NC_029516.1     23722   31174   +       LOC107305688
Chr1    NC_029516.1     34904   43887   -       LOC107312337
Chr1    NC_029516.1     49744   56456   -       LOC107317046
Chr1    NC_029516.1     57601   93908   +       GOLGB1

I want to get result like this in File3

Chr1    NC_029516.1     5341    19961   -       LRRC16B
Chr1    NC_029516.1     20428   23408   -       LOC107314597
Chr1    NC_029516.1     23722   31174   +       LOC107305688

awk bash • 6.4k views

ADD COMMENT • link updated 3.7 years ago by Kevin Blighe 89k • written 7.5 years ago by bk11 ★ 3.1k

score 5 · Answer 1 · 2017-10-17

5

Entering edit mode

7.5 years ago

Kevin Blighe 89k

Edit July 26, 2021: this seems to have been up-voted recently. As I look back at my own answer, I must add to be wary of field-separators (delimiters). Make use of -F or FS to appropriately specify the delimiter, like:

awk -F "\t" ...

or

awk '{...}' FS="\t" File1 FS=" " File2 # different field-separator for both files

Original answer:

This appears to work:

awk 'NR==FNR {end[$3]; next} ($4 in end)' File1 File2
Chr1    NC_029516.1     5341    19961   -       LRRC16B
Chr1    NC_029516.1     20428   23408   -       LOC107314597
Chr1    NC_029516.1     23722   31174   +       LOC107305688

ADD COMMENT • link 3.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, It worked! I am new to awk. Could you please explain how this code works? Thanks.

ADD REPLY • link 7.5 years ago by bk11 ★ 3.1k

1

Entering edit mode

Hi!

First, it stores all of File1, Column #3's values into an array called 'end'
Then, 'next' instructs it to move onto File2
Finally, we only print lines in File 2 where Column #4 values are found in the array ('end') that we created from File1.

You could also do it by matching the gene names.

ADD REPLY • link 7.5 years ago by Kevin Blighe 89k

score 1 · Answer 2 · 2017-10-17

Another working solution in awk:

 $ awk 'FNR==NR{a[$5]++;next}a[$6]' test1.txt test2.txt 
    Chr1    NC_029516.1     5341    19961   -       LRRC16B
    Chr1    NC_029516.1     20428   23408   -       LOC107314597
    Chr1    NC_029516.1     23722   31174   +       LOC107305688

bash solution:

$  tr -s " " < test2.txt | cut -f6 -d" " | while read CMD; do grep $CMD test1.txt; done
Chr1 NC_029516.1 19961 - LRRC16B
Chr1 NC_029516.1 23408 - LOC107314597
Chr1 NC_029516.1 31174 + LOC107305688

join solution:

$ join  -i -1 5 -2 6  <(sort -k5 test1.txt)  <(sort -k6 test2.txt) | sed 's/ /\t/g'
LOC107305688    Chr1    NC_029516.1 31174   +   Chr1    NC_029516.1 23722   31174   +
LOC107314597    Chr1    NC_029516.1 23408   -   Chr1    NC_029516.1 20428   23408   -
LRRC16B Chr1    NC_029516.1 19961   -   Chr1    NC_029516.1 5341    19961   -

score 0 · Answer 3 · 2017-10-17

0

Entering edit mode

7.5 years ago

pfs ▴ 280

What have you tried so far?

ADD COMMENT • link 7.5 years ago by pfs ▴ 280

0

Entering edit mode

I tried matching $5 of file1 with $6 of file2, but it didn't work.

 awk 'NR==FNR{a[$5]=$1;next}a[$6]{print $0 "\t" a[$6]}' file1 file2 >outfile

ADD REPLY • link 7.5 years ago by bk11 ★ 3.1k

score 0 · Answer 4 · 2017-10-17

0

Entering edit mode