Hello I have 700 metagenome assembled genomes that were taxonomically classified using the GTDB database with the GTDB-tk software
So I have taxonomic information assigned for each one of these MAGs but for downstream analysis I need the fasta headers to contain the taxonomic information that GTDB-tk assigned.
This is how the fasta headers of one of the MAGs looks like:
cat cluster1_bin.101.fa | grep '>' | head
> k141_1192826
>k141_94001
>k141_1104537
>k141_375209
>k141_375646
> k141_742386
> k141_560036
> k141_12021
> k141_838926
> k141_1209697
And I want to know if there is a way of extract the full taxonomy of the following table and give it to the respective fasta headers of a MAG:
So this is the desired output for each mag fasta headers using the "cluster1_bin.101.fa" as example
> k141_1192826 Phylum Class Order Family Genus Species
>k141_94001 Phylum Class Order Family Genus Species
>k141_1104537 Phylum Class Order Family Genus Species
>k141_375209 Phylum Class Order Family Genus Species
>k141_375646 Phylum Class Order Family Genus Species
> k141_742386 Phylum Class Order Family Genus Species
> k141_560036 Phylum Class Order Family Genus Species
> k141_12021 Phylum Class Order Family Genus Species
> k141_838926 Phylum Class Order Family Genus Species
> k141_1209697 Phylum Class Order Family Genus Species
any way to do that using any programming language?
I think this can be done literally in any programming language of your choice. It is a simple fasta header addition which can be done with existing libraries (BioPerl, BioPython), or by using awk/sed to find header lines to which extra information needs to be added. But you will most likely need to write that script on your own.
Please do not post the images of the data.
You'll need to post the table in text form for us to be able to help easily.