Compare between two columns of two different files and print common
4
0
Entering edit mode
7.2 years ago
bk11 ★ 3.0k

Hi I have two files,

File1

Chr1 NC_029516.1 19961 - LRRC16B
Chr1 NC_029516.1 23408 - LOC107314597
Chr1 NC_029516.1 31174 + LOC107305688
Chr1 NC_029516.1 156107 - ACR
Chr1 NC_029516.1 193750 + LOC107319936
Chr1 NC_029516.1 457876 + SHANK3
Chr1 NC_029516.1 549801 - DENND6B
Chr1 NC_029516.1 575842 - LOC107323953

File2

Chr1    NC_029516.1     967     5379    -       RNF212B
Chr1    NC_029516.1     5341    19961   -       LRRC16B
Chr1    NC_029516.1     20428   23408   -       LOC107314597
Chr1    NC_029516.1     23722   31174   +       LOC107305688
Chr1    NC_029516.1     34904   43887   -       LOC107312337
Chr1    NC_029516.1     49744   56456   -       LOC107317046
Chr1    NC_029516.1     57601   93908   +       GOLGB1

I want to get result like this in File3

Chr1    NC_029516.1     5341    19961   -       LRRC16B
Chr1    NC_029516.1     20428   23408   -       LOC107314597
Chr1    NC_029516.1     23722   31174   +       LOC107305688
awk bash • 6.2k views
ADD COMMENT
5
Entering edit mode
7.2 years ago

Edit July 26, 2021: this seems to have been up-voted recently. As I look back at my own answer, I must add to be wary of field-separators (delimiters). Make use of -F or FS to appropriately specify the delimiter, like:

awk -F "\t" ...

or

awk '{...}' FS="\t" File1 FS=" " File2 # different field-separator for both files


Original answer:

This appears to work:

awk 'NR==FNR {end[$3]; next} ($4 in end)' File1 File2
Chr1    NC_029516.1     5341    19961   -       LRRC16B
Chr1    NC_029516.1     20428   23408   -       LOC107314597
Chr1    NC_029516.1     23722   31174   +       LOC107305688
ADD COMMENT
0
Entering edit mode

Hi Kevin, It worked! I am new to awk. Could you please explain how this code works? Thanks.

ADD REPLY
1
Entering edit mode

Hi!

  1. First, it stores all of File1, Column #3's values into an array called 'end'
  2. Then, 'next' instructs it to move onto File2
  3. Finally, we only print lines in File 2 where Column #4 values are found in the array ('end') that we created from File1.

You could also do it by matching the gene names.

ADD REPLY
1
Entering edit mode
7.2 years ago

Another working solution in awk:

 $ awk 'FNR==NR{a[$5]++;next}a[$6]' test1.txt test2.txt 
    Chr1    NC_029516.1     5341    19961   -       LRRC16B
    Chr1    NC_029516.1     20428   23408   -       LOC107314597
    Chr1    NC_029516.1     23722   31174   +       LOC107305688

bash solution:

$  tr -s " " < test2.txt | cut -f6 -d" " | while read CMD; do grep $CMD test1.txt; done
Chr1 NC_029516.1 19961 - LRRC16B
Chr1 NC_029516.1 23408 - LOC107314597
Chr1 NC_029516.1 31174 + LOC107305688

join solution:

$ join  -i -1 5 -2 6  <(sort -k5 test1.txt)  <(sort -k6 test2.txt) | sed 's/ /\t/g'
LOC107305688    Chr1    NC_029516.1 31174   +   Chr1    NC_029516.1 23722   31174   +
LOC107314597    Chr1    NC_029516.1 23408   -   Chr1    NC_029516.1 20428   23408   -
LRRC16B Chr1    NC_029516.1 19961   -   Chr1    NC_029516.1 5341    19961   -
ADD COMMENT
0
Entering edit mode
7.2 years ago
pfs ▴ 280

What have you tried so far?

ADD COMMENT
0
Entering edit mode

I tried matching $5 of file1 with $6 of file2, but it didn't work.

 awk 'NR==FNR{a[$5]=$1;next}a[$6]{print $0 "\t" a[$6]}' file1 file2 >outfile
ADD REPLY
0
Entering edit mode
7.2 years ago
pfs ▴ 280

Untested but something like this should work:

cut -f3 file1 > positions.txt; grep -wF position.txt file2 > File3

ADD COMMENT

Login before adding your answer.

Traffic: 2084 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6