How to join two files into one in multiple folders using specific columns and add file names where they originated from
1
0
Entering edit mode
2.8 years ago
Roland • 0

Is it possible to join multiple files into one, using a specific column, and name it by the originating folder. Let me explain: I have the following files, having multiple columns and rows:

File one in path: /Bins/Bin_1/Bin_1-gene-names.txt (some of the rows are empty in specific columns):

gene_callers_id contig  start   stop    direction   COG20_CATEGORY  COG20_CATEGORY (ACCESSION)  KOfam   KOfam (ACCESSION)   Pfam    Pfam (ACCESSION)    COG20_FUNCTION  COG20_FUNCTION (ACCESSION)  KEGG_Class  KEGG_Class (ACCESSION)  KEGG_Module KEGG_Module (ACCESSION) COG20_PATHWAY   COG20_PATHWAY (ACCESSION)   dna_sequence
348 c_000000001641  92  542 r                                                           ATGG
349 c_000000001641  722 1607    f   Coenzyme transport and metabolism!!!Defense mechanisms  H!!!V   S-adenosyl-L-methionine hydrolase (adenosine-forming) [EC:3.13.1.8] K22205  S-adenosyl-l-methionine hydroxide adenosyltransferase   PF01887 Stereoselective (R,S)-S-adenosylmethionine hydrolase (adenosine-forming) (PDB:2F4N) (PUBMED:32776704)   COG1912                         ATGA
350 c_000000001641  1634    2201    f   Coenzyme transport and metabolism   H   energy-coupling factor transport system substrate-specific component    K16924  ECF-type riboflavin transporter, S component    PF07155 ECF-type riboflavin transporter, membrane (S) component (ECF-S) (PDB:4HZU) (PUBMED:23584587)    COG4720                         ATGA
351 c_000000001641  2237    4868    f   Coenzyme transport and metabolism!!!Posttranslational modification, protein turnover, chaperones    H!!!O   energy-coupling factor transport system permease/ATP-binding protein [EC:3.6.3.-]   K24040  ABC transporter PF00005 ECF-type transporter transmembrane protein EcfT (EcfT) (PDB:4HZU)!!!ABC-type glutathione transport system ATPase component, contains duplicated ATPase domain (GsiA)    COG0619!!!COG1123                           ATGG

File two in path: /Bins/Bin_1/Bin_1-coverages.txt (all columns and rows are filled):

gene_callers_id G_C_D   G_C_R   G_N_D   G_N_R   KMET_A_D    KMET_A_R    KMET_JA_D   KMET_JA_R   KMET_JU_D   KMET_JU_R   KMET_O_D    KMET_O_R    SZBU_A_D    SZBU_A_R    SZBU_JA_D   SZBU_JA_R   SZBU_JU_D   SZBU_JU_R   SZBU_O_D    SZBU_O_R    SZVT_A_D    SZVT_A_R    SZVT_JA_D   SZVT_JA_R   SZVT_JU_D   SZVT_JU_R   SZVT_O_D    SZVT_O_R
    348 0.0 0.0 0.43333333333333335 0.0 11.533333333333333  0.0 1.8844444444444444  0.0 5.253333333333333   0.6266666666666667  02.máj  0.0 0.19333333333333333 0.0 0.0 0.0 1.808888888888889   0.0 28.márc 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
    349 0.0 1.12090395480226    0.3638418079096045  0.0 15.984180790960451  0.0 5.214689265536723   0.6779661016949152  13.99322033898305   0.27231638418079096 8.910734463276835   0.6779661016949152  1.5717514124293785  0.0 0.0 0.0 1.3740112994350282  0.0 0.7694915254237288  0.0 0.3389830508474576  0.0 0.33785310734463275 0.0 0.0 0.0 0.5649717514124294  0.0
    350 0.0 0.63668430335097    0.3527336860670194  0.0 12.28747795414462   0.0 6.4638447971781305  0.0 10.800705467372135  1.0652557319223985  5.084656084656085   0.0 0.4567901234567901  0.0 0.0 0.0 2.1164021164021163  0.0 1.7989417989417988  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
    351 0.0 0.0 0.29228430254656024 0.0 20.935005701254276  0.11402508551881414 9.034207525655644   0.0 17.795515013302925  0.0 11.406689471683771  0.0 0.28506271379703535 0.0 0.10452299505891297 0.0 1.3010262257696694  0.0 0.6233371341695173  0.0 0.11402508551881414 0.0 0.05701254275940707 0.0 0.28506271379703535 0.0 0.6605853287723299  0.0

I have multiple of these in the following way:

/path/Bins/Bin_1/Bin_1-gene-names.txt and Bin_1-coverages.txt; /path/Bins/Bin_2/Bin_2-gene-names.txt and Bin_2-coverages.txt ...

I would like to join these files using the first gene_callers_id column. Moreover, I would like to add the originating file name where they come from in the following way:

Bin_name        gene_callers_id contig  start   stop    direction   COG20_CATEGORY  COG20_CATEGORY (ACCESSION)  KOfam   KOfam (ACCESSION)   Pfam    Pfam (ACCESSION)    COG20_FUNCTION  COG20_FUNCTION (ACCESSION)  KEGG_Class  KEGG_Class (ACCESSION)  KEGG_Module KEGG_Module (ACCESSION) COG20_PATHWAY   COG20_PATHWAY (ACCESSION)   dna_sequence    G_C_D   G_C_R   G_N_D   G_N_R   KMET_A_D    KMET_A_R    KMET_JA_D   KMET_JA_R   KMET_JU_D   KMET_JU_R   KMET_O_D    KMET_O_R    SZBU_A_D    SZBU_A_R    SZBU_JA_D   SZBU_JA_R   SZBU_JU_D   SZBU_JU_R   SZBU_O_D    SZBU_O_R    SZVT_A_D    SZVT_A_R    SZVT_JA_D   SZVT_JA_R   SZVT_JU_D   SZVT_JU_R   SZVT_O_D    SZVT_O_R
Bin_1   348 c_000000001641  92  542 r                                                           ATGGTT  0.0 0.0 0.43333333333333335 0.0 11.533333333333333  0.0 1.8844444444444444  0.0 5.253333333333333   0.6266666666666667  02.máj  0.0 0.19333333333333333 0.0 0.0 0.0 1.808888888888889   0.0 28.márc 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Bin_1   349 c_000000001641  722 1607    f   Coenzyme transport and metabolism!!!Defense mechanisms  H!!!V   S-adenosyl-L-methionine hydrolase (adenosine-forming) [EC:3.13.1.8] K22205  S-adenosyl-l-methionine hydroxide adenosyltransferase   PF01887 Stereoselective (R,S)-S-adenosylmethionine hydrolase (adenosine-forming) (PDB:2F4N) (PUBMED:32776704)   COG1912                         ATGACGAA    0.0 1.12090395480226    0.3638418079096045  0.0 15.984180790960451  0.0 5.214689265536723   0.6779661016949152  13.99322033898305   0.27231638418079096 8.910734463276835   0.6779661016949152  1.5717514124293785  0.0 0.0 0.0 1.3740112994350282  0.0 0.7694915254237288  0.0 0.3389830508474576  0.0 0.33785310734463275 0.0 0.0 0.0 0.5649717514124294  0.0
Bin_1   350 c_000000001641  1634    2201    f   Coenzyme transport and metabolism   H   energy-coupling factor transport system substrate-specific component    K16924  ECF-type riboflavin transporter, S component    PF07155 ECF-type riboflavin transporter, membrane (S) component (ECF-S) (PDB:4HZU) (PUBMED:23584587)    COG4720                         ATGATA  0.0 0.63668430335097    0.3527336860670194  0.0 12.28747795414462   0.0 6.4638447971781305  0.0 10.800705467372135  1.0652557319223985  5.084656084656085   0.0 0.4567901234567901  0.0 0.0 0.0 2.1164021164021163  0.0 1.7989417989417988  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Bin_1   351 c_000000001641  2237    4868    f   Coenzyme transport and metabolism!!!Posttranslational modification, protein turnover, chaperones    H!!!O   energy-coupling factor transport system permease/ATP-binding protein [EC:3.6.3.-]   K24040  ABC transporter PF00005 ECF-type transporter transmembrane protein EcfT (EcfT) (PDB:4HZU)!!!ABC-type glutathione transport system ATPase component, contains duplicated ATPase domain (GsiA)    COG0619!!!COG1123                           ATGG    0.0 0.0 0.29228430254656024 0.0 20.935005701254276  0.11402508551881414 9.034207525655644   0.0 17.795515013302925  0.0 11.406689471683771  0.0 0.28506271379703535 0.0 0.10452299505891297 0.0 1.3010262257696694  0.0 0.6233371341695173  0.0 0.11402508551881414 0.0 0.05701254275940707 0.0 0.28506271379703535 0.0 0.6605853287723299  0.0
…                                                                                                                                                                                               
Bin_2   43138   c_000000186101  380 1145    r   Translation, ribosomal structure and biogenesis J   16S rRNA (uracil1498-N3)-methyltransferase [EC:2.1.1.193]   K09761  RNA methyltransferase   PF04452 16S rRNA U1498 N3-methylase RsmE (RsmE) (PDB:1NXZ)  COG1385                 16S rRNA modification   COG1385 ATGCA   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6784313725490196  0.0 0.0 0.0 0.0 13.032679738562091  0.0 4.043137254901961   0.0 19.190849673202614  0.0 10.05359477124183   0.0
Bin_2   43139   c_000000186101  1144    2293    r   Posttranslational modification, protein turnover, chaperones    O   molecular chaperone DnaJ    K03686  DnaJ C terminal domain  PF01556 DnaJ-class molecular chaperone with C-terminal Zn finger domain (DnaJ) (PDB:1BQ0)   COG0484                         ATGGCC  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.46736292428198434 0.0 0.0 0.8703220191470844  0.0 12.056570931244561  0.0 3.352480417754569   0.0 14.157528285465622  0.0 10.736292428198434  0.0
Bin_2   43140   c_000000186101  2315    2954    r   Posttranslational modification, protein turnover, chaperones    O   molecular chaperone GrpE    K03687  GrpE    PF01025 Molecular chaperone GrpE (heat shock protein HSP-70) (GrpE) (PDB:1DKG)  COG0576                         ATGATTGAGA  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5086071987480438  0.0 0.0 0.5195618153364632  0.0 8.449139280125195   0.9389671361502347  5.841940532081377   0.0 13.726134585289515  0.4679186228482003  9.978090766823161   0.0
Bin_2   43141   c_000000186101  3190    4063    f   Cell motility!!!Intracellular trafficking, secretion, and vesicular transport!!!Extracellular structures    N!!!U!!!W           Prokaryotic N-terminal methylation motif    PF07963 Type II secretory pathway, pseudopilin PulG (PulG) (PDB:1AY2)   COG2165                         ATG 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.995418098510882   0.0 0.33104238258877433 1.4329896907216495  0.0 9.372279495990837   0.6872852233676976  3.104238258877434   0.0 16.521191294387172  0.0 6.592210767468499   0.0
…

The result should be in the "Bins" folder, named "Bins_summary.txt". Is it possible to do this with a simple command or bash?

join awk bash • 891 views
ADD COMMENT
4
Entering edit mode
2.8 years ago

and name it by the originating folder.

awk '{printf("%s\t%s\n",$0,FILENAME);}' file1.txt > file2.txt

I would like to join these files using the first gene_callers_id column

get the column index of 'gene_callers_id column'

head -n 1 file.txt | tr "\t" "\n" | cat -n | grep gene_callers_id

I would like to join

sort and use join https://shapeshed.com/unix-join/

ADD COMMENT
0
Entering edit mode

Thank you Pierre Lindenbaum! It works fine.

ADD REPLY

Login before adding your answer.

Traffic: 1633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6