Is it possible to join multiple files into one, using a specific column, and name it by the originating folder. Let me explain: I have the following files, having multiple columns and rows:
File one in path: /Bins/Bin_1/Bin_1-gene-names.txt (some of the rows are empty in specific columns):
gene_callers_id contig start stop direction COG20_CATEGORY COG20_CATEGORY (ACCESSION) KOfam KOfam (ACCESSION) Pfam Pfam (ACCESSION) COG20_FUNCTION COG20_FUNCTION (ACCESSION) KEGG_Class KEGG_Class (ACCESSION) KEGG_Module KEGG_Module (ACCESSION) COG20_PATHWAY COG20_PATHWAY (ACCESSION) dna_sequence
348 c_000000001641 92 542 r ATGG
349 c_000000001641 722 1607 f Coenzyme transport and metabolism!!!Defense mechanisms H!!!V S-adenosyl-L-methionine hydrolase (adenosine-forming) [EC:3.13.1.8] K22205 S-adenosyl-l-methionine hydroxide adenosyltransferase PF01887 Stereoselective (R,S)-S-adenosylmethionine hydrolase (adenosine-forming) (PDB:2F4N) (PUBMED:32776704) COG1912 ATGA
350 c_000000001641 1634 2201 f Coenzyme transport and metabolism H energy-coupling factor transport system substrate-specific component K16924 ECF-type riboflavin transporter, S component PF07155 ECF-type riboflavin transporter, membrane (S) component (ECF-S) (PDB:4HZU) (PUBMED:23584587) COG4720 ATGA
351 c_000000001641 2237 4868 f Coenzyme transport and metabolism!!!Posttranslational modification, protein turnover, chaperones H!!!O energy-coupling factor transport system permease/ATP-binding protein [EC:3.6.3.-] K24040 ABC transporter PF00005 ECF-type transporter transmembrane protein EcfT (EcfT) (PDB:4HZU)!!!ABC-type glutathione transport system ATPase component, contains duplicated ATPase domain (GsiA) COG0619!!!COG1123 ATGG
File two in path: /Bins/Bin_1/Bin_1-coverages.txt (all columns and rows are filled):
gene_callers_id G_C_D G_C_R G_N_D G_N_R KMET_A_D KMET_A_R KMET_JA_D KMET_JA_R KMET_JU_D KMET_JU_R KMET_O_D KMET_O_R SZBU_A_D SZBU_A_R SZBU_JA_D SZBU_JA_R SZBU_JU_D SZBU_JU_R SZBU_O_D SZBU_O_R SZVT_A_D SZVT_A_R SZVT_JA_D SZVT_JA_R SZVT_JU_D SZVT_JU_R SZVT_O_D SZVT_O_R
348 0.0 0.0 0.43333333333333335 0.0 11.533333333333333 0.0 1.8844444444444444 0.0 5.253333333333333 0.6266666666666667 02.máj 0.0 0.19333333333333333 0.0 0.0 0.0 1.808888888888889 0.0 28.márc 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
349 0.0 1.12090395480226 0.3638418079096045 0.0 15.984180790960451 0.0 5.214689265536723 0.6779661016949152 13.99322033898305 0.27231638418079096 8.910734463276835 0.6779661016949152 1.5717514124293785 0.0 0.0 0.0 1.3740112994350282 0.0 0.7694915254237288 0.0 0.3389830508474576 0.0 0.33785310734463275 0.0 0.0 0.0 0.5649717514124294 0.0
350 0.0 0.63668430335097 0.3527336860670194 0.0 12.28747795414462 0.0 6.4638447971781305 0.0 10.800705467372135 1.0652557319223985 5.084656084656085 0.0 0.4567901234567901 0.0 0.0 0.0 2.1164021164021163 0.0 1.7989417989417988 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
351 0.0 0.0 0.29228430254656024 0.0 20.935005701254276 0.11402508551881414 9.034207525655644 0.0 17.795515013302925 0.0 11.406689471683771 0.0 0.28506271379703535 0.0 0.10452299505891297 0.0 1.3010262257696694 0.0 0.6233371341695173 0.0 0.11402508551881414 0.0 0.05701254275940707 0.0 0.28506271379703535 0.0 0.6605853287723299 0.0
I have multiple of these in the following way:
/path/Bins/Bin_1/Bin_1-gene-names.txt and Bin_1-coverages.txt; /path/Bins/Bin_2/Bin_2-gene-names.txt and Bin_2-coverages.txt ...
I would like to join these files using the first gene_callers_id column. Moreover, I would like to add the originating file name where they come from in the following way:
Bin_name gene_callers_id contig start stop direction COG20_CATEGORY COG20_CATEGORY (ACCESSION) KOfam KOfam (ACCESSION) Pfam Pfam (ACCESSION) COG20_FUNCTION COG20_FUNCTION (ACCESSION) KEGG_Class KEGG_Class (ACCESSION) KEGG_Module KEGG_Module (ACCESSION) COG20_PATHWAY COG20_PATHWAY (ACCESSION) dna_sequence G_C_D G_C_R G_N_D G_N_R KMET_A_D KMET_A_R KMET_JA_D KMET_JA_R KMET_JU_D KMET_JU_R KMET_O_D KMET_O_R SZBU_A_D SZBU_A_R SZBU_JA_D SZBU_JA_R SZBU_JU_D SZBU_JU_R SZBU_O_D SZBU_O_R SZVT_A_D SZVT_A_R SZVT_JA_D SZVT_JA_R SZVT_JU_D SZVT_JU_R SZVT_O_D SZVT_O_R
Bin_1 348 c_000000001641 92 542 r ATGGTT 0.0 0.0 0.43333333333333335 0.0 11.533333333333333 0.0 1.8844444444444444 0.0 5.253333333333333 0.6266666666666667 02.máj 0.0 0.19333333333333333 0.0 0.0 0.0 1.808888888888889 0.0 28.márc 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Bin_1 349 c_000000001641 722 1607 f Coenzyme transport and metabolism!!!Defense mechanisms H!!!V S-adenosyl-L-methionine hydrolase (adenosine-forming) [EC:3.13.1.8] K22205 S-adenosyl-l-methionine hydroxide adenosyltransferase PF01887 Stereoselective (R,S)-S-adenosylmethionine hydrolase (adenosine-forming) (PDB:2F4N) (PUBMED:32776704) COG1912 ATGACGAA 0.0 1.12090395480226 0.3638418079096045 0.0 15.984180790960451 0.0 5.214689265536723 0.6779661016949152 13.99322033898305 0.27231638418079096 8.910734463276835 0.6779661016949152 1.5717514124293785 0.0 0.0 0.0 1.3740112994350282 0.0 0.7694915254237288 0.0 0.3389830508474576 0.0 0.33785310734463275 0.0 0.0 0.0 0.5649717514124294 0.0
Bin_1 350 c_000000001641 1634 2201 f Coenzyme transport and metabolism H energy-coupling factor transport system substrate-specific component K16924 ECF-type riboflavin transporter, S component PF07155 ECF-type riboflavin transporter, membrane (S) component (ECF-S) (PDB:4HZU) (PUBMED:23584587) COG4720 ATGATA 0.0 0.63668430335097 0.3527336860670194 0.0 12.28747795414462 0.0 6.4638447971781305 0.0 10.800705467372135 1.0652557319223985 5.084656084656085 0.0 0.4567901234567901 0.0 0.0 0.0 2.1164021164021163 0.0 1.7989417989417988 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Bin_1 351 c_000000001641 2237 4868 f Coenzyme transport and metabolism!!!Posttranslational modification, protein turnover, chaperones H!!!O energy-coupling factor transport system permease/ATP-binding protein [EC:3.6.3.-] K24040 ABC transporter PF00005 ECF-type transporter transmembrane protein EcfT (EcfT) (PDB:4HZU)!!!ABC-type glutathione transport system ATPase component, contains duplicated ATPase domain (GsiA) COG0619!!!COG1123 ATGG 0.0 0.0 0.29228430254656024 0.0 20.935005701254276 0.11402508551881414 9.034207525655644 0.0 17.795515013302925 0.0 11.406689471683771 0.0 0.28506271379703535 0.0 0.10452299505891297 0.0 1.3010262257696694 0.0 0.6233371341695173 0.0 0.11402508551881414 0.0 0.05701254275940707 0.0 0.28506271379703535 0.0 0.6605853287723299 0.0
…
Bin_2 43138 c_000000186101 380 1145 r Translation, ribosomal structure and biogenesis J 16S rRNA (uracil1498-N3)-methyltransferase [EC:2.1.1.193] K09761 RNA methyltransferase PF04452 16S rRNA U1498 N3-methylase RsmE (RsmE) (PDB:1NXZ) COG1385 16S rRNA modification COG1385 ATGCA 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6784313725490196 0.0 0.0 0.0 0.0 13.032679738562091 0.0 4.043137254901961 0.0 19.190849673202614 0.0 10.05359477124183 0.0
Bin_2 43139 c_000000186101 1144 2293 r Posttranslational modification, protein turnover, chaperones O molecular chaperone DnaJ K03686 DnaJ C terminal domain PF01556 DnaJ-class molecular chaperone with C-terminal Zn finger domain (DnaJ) (PDB:1BQ0) COG0484 ATGGCC 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.46736292428198434 0.0 0.0 0.8703220191470844 0.0 12.056570931244561 0.0 3.352480417754569 0.0 14.157528285465622 0.0 10.736292428198434 0.0
Bin_2 43140 c_000000186101 2315 2954 r Posttranslational modification, protein turnover, chaperones O molecular chaperone GrpE K03687 GrpE PF01025 Molecular chaperone GrpE (heat shock protein HSP-70) (GrpE) (PDB:1DKG) COG0576 ATGATTGAGA 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5086071987480438 0.0 0.0 0.5195618153364632 0.0 8.449139280125195 0.9389671361502347 5.841940532081377 0.0 13.726134585289515 0.4679186228482003 9.978090766823161 0.0
Bin_2 43141 c_000000186101 3190 4063 f Cell motility!!!Intracellular trafficking, secretion, and vesicular transport!!!Extracellular structures N!!!U!!!W Prokaryotic N-terminal methylation motif PF07963 Type II secretory pathway, pseudopilin PulG (PulG) (PDB:1AY2) COG2165 ATG 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.995418098510882 0.0 0.33104238258877433 1.4329896907216495 0.0 9.372279495990837 0.6872852233676976 3.104238258877434 0.0 16.521191294387172 0.0 6.592210767468499 0.0
…
The result should be in the "Bins" folder, named "Bins_summary.txt". Is it possible to do this with a simple command or bash?
Thank you Pierre Lindenbaum! It works fine.