Replace Accesion Headers by full ncbi Lineage Taxnomy
1
0
Entering edit mode
6.3 years ago
dllopezr ▴ 130

Hi everybody

I have several multifasta files with headers like this:

>NC_014760.1:c365195-363993 Mycoplasma bovis PG45 clone MU clone A2, complete genome
>NC_013511.1:c476217-475021 Mycoplasma hominis ATCC 23114 chromosome complete genome

I would like to replace this headers by ncbi full taxonomy:

>Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae,Bacillus,Bacillus anthracis

Can you help me with that?

This question FASTA file with taxid, want to obtain lineage and put into header is almost the same that I have, but in there the user had ncbi tax ID instead of accesion number-

Thank you

gene sequence • 2.0k views
ADD COMMENT
0
Entering edit mode

Hello dllopezr,

could you please guide me where the full taxonomy is from and how it is connected to your current headers?

fin swimmer

ADD REPLY
0
Entering edit mode

The NCBI taxonomy always works in combination with the tax ID. I use the files merged.dmp and rankedlineage.dmp from here ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.zip and ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz.

ADD REPLY
2
Entering edit mode
6.3 years ago

Dataset and tools:

Here we go:

  1. Few lines of nucl_gb.accession2taxid.gz

    $ zcat nucl_gb.accession2taxid.gz | head -n 3 | column -t
    accession  accession.version  taxid  gi
    A00002     A00002.1           9913   2
    A00003     A00003.1           9913   3
    
  2. Retrieving the accession from fasta file

    $ cat seqs.fa 
    >NC_014760.1:c365195-363993 Mycoplasma bovis PG45 clone MU clone A2, complete genome
    actg
    >NC_013511.1:c476217-475021 Mycoplasma hominis ATCC 23114 chromosome complete genome
    ACTN
    
    $ seqkit seq -n -i --id-regexp '^(.+):' seqs.fa | tee seqs.fa.id
    NC_014760.1
    NC_013511.1
    
  3. Searching the taxids

    $ time csvtk grep -t -f accession.version -P seqs.fa.id nucl_gb.accession2taxid.gz \
        | csvtk cut -t -f accession.version,taxid | tee seqs.fa.id2taxid
    
    NC_013511.1     347256
    NC_014760.1     289397
    
    real    0m50.546s
    user    1m44.892s
    sys     0m4.160s
    
    # faster using shell grep,
    $ time zcat nucl_gb.accession2taxid.gz \
        | grep -w  -f seqs.fa.id | cut -f 2,3 | tee seqs.fa.id2taxid
    NC_013511.1     347256
    NC_014760.1     289397
    
    real    0m19.034s
    user    0m23.868s
    sys     0m1.384s
    
  4. Computing lineage

    $ time cat seqs.fa.id2taxid \
        | taxonkit lineage -i 2 \
        | sed 1d | cut -f 1,3 | tee seqs.fa.id2taxid2lineage
    
    NC_013511.1     cellular organisms;Bacteria;Terrabacteria group;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma hominis;Mycoplasma hominis ATCC 23114
    NC_014760.1     cellular organisms;Bacteria;Terrabacteria group;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma bovis;Mycoplasma bovis PG45
    
    real    0m3.200s
    user    0m5.392s
    sys     0m0.224s
    
  5. Replacing fasta header

    $ seqkit replace -k seqs.fa.id2taxid2lineage -p '^(.+):.+' -r '{kv}' seqs.fa \
        | seqkit replace -p ';' -r ','
    
    >cellular organisms,Bacteria,Terrabacteria group,Tenericutes,Mollicutes,Mycoplasmatales,Mycoplasmataceae,Mycoplasma,Mycoplasma bovis,Mycoplasma bovis PG45
    actg
    >cellular organisms,Bacteria,Terrabacteria group,Tenericutes,Mollicutes,Mycoplasmatales,Mycoplasmataceae,Mycoplasma,Mycoplasma hominis,Mycoplasma hominis ATCC 23114
    ACTN
    

You want the lineage in format of "kingdom, phylum ... species" instead of full lineage? Fine:

$ time cat seqs.fa.id2taxid \
    | taxonkit lineage -i 2 \
    | taxonkit reformat -i 3 
    | sed 1d | cut -f 1,4 | tee seqs.fa.id2taxid2lineage2

NC_013511.1     Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma hominis
NC_014760.1     Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma bovis

real    0m5.183s
user    0m13.920s
sys     0m0.764s

$ seqkit replace -k seqs.fa.id2taxid2lineage2 -p '^(.+):.+' -r '{kv}' seqs.fa \
    | seqkit replace -p ';' -r ','

>Bacteria,Tenericutes,Mollicutes,Mycoplasmatales,Mycoplasmataceae,Mycoplasma,Mycoplasma bovis
actg
>Bacteria,Tenericutes,Mollicutes,Mycoplasmatales,Mycoplasmataceae,Mycoplasma,Mycoplasma hominis
ACTN

It's exact what you want.

ADD COMMENT

Login before adding your answer.

Traffic: 987 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6