Question

Join two files by matching columns

0

Entering edit mode

3.4 years ago

er.doug.ragnar ▴ 30

Hi

I am trying to join two files, one has 4 columns the other 3. I want to match the third column of the first to the second column on the second file and add the 3rd column of the second file to the first, when the data from each column match. I managed to do it but only when I output all columns from second file. Like this:

join -t' ' -1 3 -2 2 -o 1.1,1.2,1.3,1.4,2.1 join_2.txt join_1.txt > join_try_3.txt

If I run this, it doesn't work:

join -t' ' -1 3 -2 2 -o 1.1,1.2,1.3,1.4,2.3 join_2.txt join_1.txt > join_try_3.txt

Any ideas why? And suggestions to only add the the 3rd column of the second file would be appreciated.

Thanks in advance.

join linux • 1.5k views

ADD COMMENT • link 3.4 years ago by er.doug.ragnar ▴ 30

0

Entering edit mode

how is it related to bioinformatics ? (and BTW how did you sort join_* ?)

ADD REPLY • link 3.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

The 2 files are blast results in output format 6, I sorted each file by the columns I wanted to match.

ADD REPLY • link 3.4 years ago by er.doug.ragnar ▴ 30

0

Entering edit mode

The 2 files are blast results in output format 6, I sorted each file by the columns I wanted to match.

ADD REPLY • link 3.4 years ago by er.doug.ragnar ▴ 30

0

Entering edit mode

I sorted each file by the columns I wanted to match.

show us the command line please.

ADD REPLY • link 3.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

The previous values of the columns were of test files I created to practice on them before doing it on my files. My actual is the following: I checked my files and the blast results I think all lines are correct for join, lets call this file A. That file is like this:

 WP_002438140.1  1|3440624|6|Catalog-(metahit-v2)|MC3.MG6.AS1.GP1.C47040.G4|aac6    40.506  158     93      1       2       158     3       160     4.02e-38        126
 WP_002438140.1  1|1186628|6|Catalog-(metahit-v2)|MC3.MG2.AS1.GP1.C18992.G2|aac6    39.623  159     94      2       1       158     17      174     8.48e-29        102 
WP_002438140.1  1|1998910|6|Catalog-(metahit-v2)|MC3.MG293.AS1.GP1.C1957.G20|aac6    38.889  162     97      2       1       161     1       161     6.13e-28        100 
WP_002438140.1  1|1715055|6|Catalog-(metahit-v2)|MC3.MG262.AS1.GP1.C7747.G3|aac6    35.625  160     98      4       2       158     7       164     4.24e-26        95.9
WP_002438140.1  1|1459548|6|Catalog-(metahit-v2)|MC3.MG232.AS1.GP1.C19886.G1|aac6    34.320  169     103     2       2       163     6       173     1.92e-24        92.0 
WP_002438140.1  1|1444795|6|Catalog-(metahit-v2)|MC3.MG229.AS1.GP1.C15873.G17|aac6    35.152  165     105     1       2       164     14      178     2.56e-24        91.7 
WP_002438140.1  1|2947214|6|Catalog-(metahit-v2)|MC3.MG373.AS1.GP1.C76048.G3|aac6    30.380  158     109     1       1       158     1       157     1.41e-23        89.0 
WP_002438140.1  1|3816016|6|Catalog-(metahit-v2)|MC3.MG96.AS1.GP1.C24897.G23|aac6    31.013  158     108     1       1       158     1       157     8.44e-23        87.0 
WP_002438140.1  1|735451|6|Catalog-(metahit-v2)|MC3.MG148.AS1.GP1.C6414.G2|aac6    32.515  163     101     3       1       158     1       159     1.15e-22        87.0 
WP_002438140.1  1|3779723|6|Catalog-(metahit-v2)|MC3.MG94.AS1.GP1.C579.G4|aac6    31.707  164     105     2       1       161     32      191     1.41e-22        87.0

The other file, lets call it file B, some lines do not have all the columns :

    GCF_902459825.1 chromosome      1       1188    WP_156068365.1  WP_156068365.1  1187
    GCF_902459825.1 chromosome      1716    2924    WP_010950321.1  WP_010950321.1  1208
    GCF_902459825.1 chromosome      2944    4101    WP_003400273.1  WP_003400273.1  1157
    GCF_902459825.1 chromosome      4098    4661    WP_003899769.1  WP_003899769.1  563
    GCF_902459825.1 chromosome      4904    6931    WP_031652105.1  WP_031652105.1  2027
    GCF_902459825.1 chromosome      6966    9482    WP_003400286.1  WP_003400286.1  2516
    GCF_902459825.1 chromosome      9578    10492   WP_003400291.1  WP_003400291.1  914
    GCF_902459825.1 chromosome      10889   11080   WP_003400294.1  WP_003400294.1  191
    GCF_902459825.1 chromosome      11219   11356   WP_003400297.1  WP_003400297.1  137
    GCF_902459825.1 chromosome      11538   11975   WP_003400307.1  WP_003400307.1  437
    GCF_902459825.1 chromosome      12132   12680   WP_003400321.1  WP_003400321.1  548
    GCF_902459825.1 chromosome      12797   13221                   424
    GCF_902459825.1 chromosome      13377   13658   WP_003400344.1  WP_003400344.1  281
    GCF_902459825.1 chromosome      13752   14540   WP_003902799.1  WP_003902799.1  788
    GCF_902459825.1 chromosome      14577   15275   WP_003400352.1  WP_003400352.1  698
    GCF_902459825.1 chromosome      15253   17133   WP_003400356.1  WP_003400356.1  1880
    GCF_902459825.1 chromosome      17130   18425   WP_003400358.1  WP_003400358.1  1295

I want to match the fifth column in file B to the 1st column in file A and when it matches add column 7 to file A. Should I use an AWK instead of join?

ADD REPLY • link 3.4 years ago by er.doug.ragnar ▴ 30

0

Entering edit mode

show us the command line please.

ADD REPLY • link 3.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

S530FN:~$ join -t' ' -1 1 -2 5 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,2.7 join_2.txt join_1.txt > join_try_3.txt

It gives this error:

join: invalid file number in field spec: ‘sample1_blast.out

ADD REPLY • link 3.4 years ago by er.doug.ragnar ▴ 30

0

Entering edit mode

my question 'show us the command line' was about how you sorted both files. Furthermore, from what you showed above, it's hard to say if you used a space or a tab. At this point, I'm leaving this thread.

ADD REPLY • link 3.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Well I did not understand, I thought it was the command I was having issues with .

But here is the sort commands:

sort -k 1 sample1_blast.out 

sort -k 5 sample2_my_files.txt

ADD REPLY • link 3.4 years ago by er.doug.ragnar ▴ 30