How could I change the headers of different fasta files with the help of the large file nucl_wgs.accession2taxid.gz with bash scripting?
I show the following example. All fasta files in the header begin with the GenBank accession version:
$ grep "^>" GCA_000002435.2_UU_WB_2.1_genomic.fna | head
>CM018789.1 Giardia intestinalis strain WB C6 chromosome 1, whole genome shotgun sequence
>CM018790.1 Giardia intestinalis strain WB C6 chromosome 2, whole genome shotgun sequence
>CM018791.1 Giardia intestinalis strain WB C6 chromosome 3, whole genome shotgun sequence
>CM018792.1 Giardia intestinalis strain WB C6 chromosome 4, whole genome shotgun sequence
>CM018793.1 Giardia intestinalis strain WB C6 chromosome 5, whole genome shotgun sequence
>AACB03000006.1 Giardia intestinalis strain WB C6 tig00000001, whole genome shotgun sequence
>AACB03000007.1 Giardia intestinalis strain WB C6 tig00000004, whole genome shotgun sequence
This accession version also appears in the large file nucl_wgs.accession2taxid.gz available in https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz:
$ zcat nucl_wgs.accession2taxid.gz | head
accession accession.version taxid gi
AAAA00000000 AAAA00000000.2 39946 54362548
AAAA02000001 AAAA02000001.1 39946 54312315
AAAA02000002 AAAA02000002.1 39946 54312316
AAAA02000003 AAAA02000003.1 39946 54312317
AAAA02000004 AAAA02000004.1 39946 54312318
AAAA02000005 AAAA02000005.1 39946 54312319
AAAA02000006 AAAA02000006.1 39946 54312320
AAAA02000007 AAAA02000007.1 39946 54312321
AAAA02000008 AAAA02000008.1 39946 54312322
I am interested in changing the headers of my fasta files in the way of changing accession.version to taxid_accession.version (column 3 and 2 of nucl_wgs.accession2taxid.gz file).
I obtain an error, the process runs but at the end is killed.
I see, the acc2taxid.txt is too big to fill into the main memory. I've updated the answer which generates a subset acc2taxid.txt.