Hello,
I have hundreds of taxonomic annotation files. The 4th column of the files is the taxonomic rank. The first few rows of the files look like this:
6.46 387327 387327 U 0 unclassified
93.54 5610481 488 R 1 root
93.53 5609584 75743 R1 2 d__Bacteria
43.31 2597449 11790 R2 18 p__Actinobacteriota
23.04 1382144 149 R3 19 c__Actinobacteria
22.98 1378342 590 R4 20 o__Actinomycetales
22.55 1352503 35 R5 21 f__Bifidobacteriaceae
22.54 1352180 54264 R6 22 g__Bifidobacterium
9.43 565635 565635 R7 1797 s__Bifidobacterium adolescentis
R1=Domain, R2=Phylum, R3=Class, R4=Order, R5=Family, R6=Genus, and R7=Species.
I want to change the R[2-7]
with the uppercase first letter of the respective taxonomic order. Additionally, I also want to remove the double underscores and their prefixes before the taxonomic names to make it usable for another tool. The desired output should look like this:
6.46 387327 387327 U 0 unclassified
93.54 5610481 488 R 1 root
93.53 5609584 75743 R1 2 Bacteria
43.31 2597449 11790 P 18 Actinobacteriota
23.04 1382144 149 C 19 Actinobacteria
22.98 1378342 590 O 20 Actinomycetales
22.55 1352503 35 F 21 Bifidobacteriaceae
22.54 1352180 54264 G 22 Bifidobacterium
9.43 565635 565635 S 1797 Bifidobacterium adolescentis
Please note that Bacteria (domain) would still have the R1 value
. Since I have hundreds of such files, it's quite difficult to make the changes in excel or any text editor.
Could you please suggest a better option?
Many thanks for your time and help!
Matthias Zepper thank you very much for the solution and explaining it to me. I tried it on the Linux system. The command line is doing its job, but after the conversion (i.e., R# to P/C/O..etc.) 3rd row onwards all columns are becoming space separated. Is there a way to use tab as a column separator (3rd row onwards)?
Many thanks!
Yes. You can specify the output separator in awk with OFS. Add a BEGIN statement to the awk command like so:
awk 'BEGIN{OFS="\t"}; ...}'
The rest of the command is unchanged.Thanks a lot, Matthias Zepper, it solved my problem.