Question

How to change headers of fasta files with the help of another file (nucl_wgs.accession2taxid.gz) containing the desired names (taxID_ accession.version)

0

Entering edit mode

3.6 years ago

Sebastian • 0

How could I change the headers of different fasta files with the help of the large file nucl_wgs.accession2taxid.gz with bash scripting?

I show the following example. All fasta files in the header begin with the GenBank accession version:

 $ grep "^>" GCA_000002435.2_UU_WB_2.1_genomic.fna | head
>CM018789.1 Giardia intestinalis strain WB C6 chromosome 1, whole genome shotgun sequence
>CM018790.1 Giardia intestinalis strain WB C6 chromosome 2, whole genome shotgun sequence
>CM018791.1 Giardia intestinalis strain WB C6 chromosome 3, whole genome shotgun sequence
>CM018792.1 Giardia intestinalis strain WB C6 chromosome 4, whole genome shotgun sequence
>CM018793.1 Giardia intestinalis strain WB C6 chromosome 5, whole genome shotgun sequence
>AACB03000006.1 Giardia intestinalis strain WB C6 tig00000001, whole genome shotgun sequence
>AACB03000007.1 Giardia intestinalis strain WB C6 tig00000004, whole genome shotgun sequence

This accession version also appears in the large file nucl_wgs.accession2taxid.gz available in https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz:

 $ zcat nucl_wgs.accession2taxid.gz | head
accession       accession.version       taxid   gi
AAAA00000000    AAAA00000000.2  39946   54362548
AAAA02000001    AAAA02000001.1  39946   54312315
AAAA02000002    AAAA02000002.1  39946   54312316
AAAA02000003    AAAA02000003.1  39946   54312317
AAAA02000004    AAAA02000004.1  39946   54312318
AAAA02000005    AAAA02000005.1  39946   54312319
AAAA02000006    AAAA02000006.1  39946   54312320
AAAA02000007    AAAA02000007.1  39946   54312321
AAAA02000008    AAAA02000008.1  39946   54312322

I am interested in changing the headers of my fasta files in the way of changing accession.version to taxid_accession.version (column 3 and 2 of nucl_wgs.accession2taxid.gz file).

wgs GenBank fasta bash Taxonomy • 3.0k views

ADD COMMENT • link updated 3.6 years ago by shenwei356 8.7k • written 3.6 years ago by Sebastian • 0

score 0 · Answer 1 · 2021-04-23

0

Entering edit mode

3.6 years ago

shenwei356 8.7k

seqkit replace could edit fasta header with regular expression, matched patterns could be replaced with corresponding values via a tabular key-value file.

(updated)

Extracting accession.versions in the fasta file
```
 seqkit seq -n -i files.fna > acc.txt
```

Extracting accession.versions and their taxids with help of csvtk grep

 zcat nucl_wgs.accession2taxid.gz \
     | csvtk grep -t -f accession.version -P acc.txt \
     | csvtk cut -t -f accession.version,taxid \
     > acc2taxid.txt

Replacing accession.version with taxids

 seqkit replace -k acc2taxid.txt -p '^(\S+)' -r '{kv}_$1' \
     files.fna -o result.fna

The orignial solution produces a huge acc2taxid.txt that would exhaust the RAM.

Preparing the key-value file.

 zcat nucl_wgs.accession2taxid.gz \
     | cut -f 2,3 > acc2taxid.txt

Replacing accession.version with taxids

 seqkit replace -k acc2taxid.txt -p '^(\S+)' -r '{kv}_$1' \
     files.fna -o result.fna

ADD COMMENT • link 3.6 years ago by shenwei356 8.7k

0

Entering edit mode

I obtain an error, the process runs but at the end is killed.

$ seqkit replace -k acc2taxid.txt -p '^(\S+)' -r '{kv}_$1' \ GCA_000002435.2_UU_WB_2.1_genomic.fna -o result
.fna

[INFO] read key-value file: acc2taxid.txt


Killed

ADD REPLY • link 3.6 years ago by Sebastian • 0

0

Entering edit mode

I see, the acc2taxid.txt is too big to fill into the main memory. I've updated the answer which generates a subset acc2taxid.txt.

ADD REPLY • link 3.6 years ago by shenwei356 8.7k