I am trying to build a database for metagenomic analysis have all the genomic.fna.gz files for bacteria and virus from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/ release 77 which has 42080 species for bacteria and 5654 for viral
header for viral sequence looks like this
>gi|433660771|ref|NC_019944.1| Okra enation leaf curl alphasatellite, complete sequence
And for bacterial
>gi|759427590|ref|NZ_CDDW01000001.1| Aeromonas salmonicida subsp. salmonicida genome assembly PRJEB7036, contig F321_contig22, whole genome shotgun sequence
Is there a way i can modify all the fasta files (n =659 for bacterial and n=2 for viral) in such a way that my header look like this for both viral and bacterial :
>NZ_CDDW01000001.1|some_text|taxid Aeromonas salmonicida subsp. salmonicida genome assembly PRJEB7036 contig F321_contig22, whole genome shotgun sequence
Thanks
Badri
Yes I can write scripts, but the problem is i have concatenated all the fna to one huge fasta file which is 235Gb. And my awk script is not helping to slow.