I need to extract the headers from a multifasta file being used as a database for metagenomic analysis. The structure of the fasta file is as follows:
>NW_002197112.1 Penicillium marneffei ATCC 18224 scf_1107713384177, whole genome shotgun sequence
GCCTTAAAATGCCGCTTCCCAGATCTGCGCCGAAGAGCAATCCATCTCCTCTCCAGCCCCAATGCAGCAACTGCTAACGG
CAGTGCGACGTGCGGGGTGAATTTCAGCGGTTGCTATCGACTTGTGCCATCGCAGCGTTTTCGCGTCCACGGTCGCCGCC
GCATGCTCCATGCACGATATGGCTGGTCGGATGCTAGTTGTGCTC
>NW_002197111.1 Penicillium marneffei ATCC 18224 scf_1107713383857, whole genome shotgun sequence
TACTGCTTTGTGGAACATCGCCCTTGTGGAGATCTCCCTCACGCTGGATGTTGAAAGACGCAGAACAGTTGGCACAGCCA
ATTTAGAATGCCTGATCAAGACGCATCGCCACATCCAGGCAGGTGCGATTCCTCTCTTATAAATAAATATTTTCAACGGC
ATCTGGAGAACTCATCAACTTGCAGTTGCTCATCATTATCTCGGTCAT
What I need to do is extract only the identifiers and the taxonomy to create a taxonomy.txt file with the same structure as shown below, with taxonomy separated by level and the taxon identifier in the final column:
Saccharomycetaceae;Kluyveromyces;lactis;CR382121.1
Saccharomycetaceae;Kluyveromyces;lactis;CR382122.1
What have you tried so far? Did you attempt a solution with awk / perl / python / whatever? Did you try searching BioStars? Manipulating fasta headers is a recurring theme, searching the site should give you enough material to at least get started.
I tried a couple methods using grep, but I was extracting everthing to a text file instead of just the taxonomy and ID number
Do you have a file for mapping accession ids to taxonomy? If so, you can use this code
and then you can add more commands to the last grep to modify the output.