How to add taxonomic information for each fasta header ?
1
1
Entering edit mode
2.9 years ago

Hello I need to add taxonomic information for each of my metagneome assembled genomes (MAGs).

each fasta file is to a representative draft genome of a bacterial specie. I used GTDB-Tk to classify my MAGs. The issue is that I need to add to each fasta header of each contig the taxonomic assignation on genus taxonomic category.

for example the fasta headers of the following MAG looks like this:

cat bin.10.fna | grep '>' | head -3

k141_811263

k141_1446

k141_974775

now from the GTDB-Tk output file, each comma separated column is to a different taxonomic rank. Lets say I want to add genus text in a tab separated way to each fasta header of each MAG fasta file. I can retrieve all the genus information using the following:

awk -F "\t" '{print $7}' GTDB-tk_output.bac120.summary.csv | head -6

Genus

Clostridium_p

psychrobacter

Clostridium

Anaerobiospirillum_A

Sutterella

So I want to add the correspoding genus name to all the contigs fasta headers of a MAG , this is the desired output:

cat bin.10.fna | grep '>' | head -3

k141_811263 Clostridium_p

k141_1446 Clostridium_p

k141_974775 Clostridium_p

As input files I have the gtdb-tk sumamry csv output file whose columns are comma separated:

bin_id Domain Phylum Class Order Family Genus Species

The bin_id column is to the bin name , they are displayed in the folowing way :

bin.1

bin.2

bin.3

....

bin.75

And a directory in which are all the fasta of my MAGs. those files are named on the following way:

bin.#.fna where '#' is a number from 1 to 75.

both the bin_id column values and the ls output order of the MAGs directory are the same but for the ".fna" suffix of each of the ls output file, id est:

awk -F "\t" 'NR>1 {print $1}' GTDB-tk_output.bac120.summary.csv | head -5

bin.1

bin.2

bin.3

bin.4

bin.5

ls | head -5

bin.1.fna

bin.2.fna

bin.3.fna

bin.4.fna

bin.5.fna

So how can I do that?

awk taxonomy FASTA • 1.6k views
ADD COMMENT
0
Entering edit mode

Please format this post correctly using 101010 code button. Using > turns text into quote. See example below.

Using just > in front of a line turns it into quoted text.

This is a fasta header

Using 1010 to properly format the text displays it properly as fasta header.

> This is a fasta header

It is difficult to figure out what you want above since most of it looks like plain quoted text.

ADD REPLY
1
Entering edit mode
2.9 years ago

Strategy:

  1. search and get the genus name with bin_id
  2. rename the fasta header

Here's a solution with my tools. All can be replaced with GNU tools.

  • rush is for parallelizing processing and easily extracting bin_id with {%.} from a file's basename. GNU parallel works too, by {/.}.
  • csvtk is for extracting Genus for a bin_id, GNU grep + cut + sed can also work, just remember to add -w and use grep -w "^bin.1" to avoid matching bin.10.
  • seqkit is for editing fasta header, GNU sed works too.
  • I assume GTDB-tk_output.bac120.summary.csv is actually tab-delimited according to the description. If not, remove -t for csvtk.
ls *.fna \
    | rush 'g=$(csvtk grep -t -f bin_id -p {%.} GTDB-tk_output.bac120.summary.csv \
                | csvtk cut -t -f Genus \
                | csvtk del-header); \
            echo {%}: $g; \
            seqkit replace -p "$" -r " $g" {} -o {}.fasta'
ADD COMMENT
0
Entering edit mode

First of all thanks so much for your answer!

second I got this error form rush:

usage: rush [-h] [-d] <command> ...
rush: error: argument "<command>": Invalid choice: g=$(csvtk grep -t -f bin_id -p {%.} SLF_MAGs.bac120.summary.csv \
                | csvtk cut -t -f Genus \
                | csvtk del-header); \
            echo {%}: $g; \
            seqkit replace -p "$" -r " $g" {} -o {}.fasta'

(choose from [add, change, check, deploy, init, init-autoinstaller, init-deploy, install, link, list, publish, purge, scan, setup, unlink, update, update-autoinstaller, update-cloud-credentials, version, write-build-cache, build, rebuild, tab-complete])

why do I get this?

ADD REPLY
0
Entering edit mode

It seems to be another rush, check:

which rush
rush -h
ADD REPLY

Login before adding your answer.

Traffic: 1832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6