Question

Help me with bacteria classification

0

Entering edit mode

2.5 years ago

Giulia.cosenza ▴ 110

Hi, I have a fastQ file obtained with a metagenomic untargeted sequencing. I performed a taxonomic analysis of it with kraken2, and my output looks like this:

enter image description here

The fields of the output, from left-to-right, are as follows:

-Percentage of fragments covered by the clade rooted at this taxon

-Number of fragments covered by the clade rooted at this taxon

-Number of fragments assigned directly to this taxon

-A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., "G2" is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.

-NCBI taxonomic ID number

-Indented scientific name

I'd like to understand the nature of all the different species obtained, in particular I'd like to know their source of isolation, if they are pathogen or not, etc...

What is the best way to do that?

Someone suggested me to use EntrezDirect like this:

$ esearch -db biosample -query SAMN10026047 | efetch

1: Corallococcus genome_CA054A

Identifiers: BioSample: SAMN10026047; Sample name: Corallococcus CA054A

Organism: Corallococcus terminator

Attributes:

/strain="CA054A"

/isolation source="soil"

/collection date="2016-09-28"

/geographic location="United Kingdom"

/sample type="Bacterial Isolate"

/identified by="Aberystwyth University"

/type-material="type strain of Corallococcus terminator"

Accession: SAMN10026047 ID: 10026047

But I do not know the accession number of the species, I only have their name and their Taxonomic ID.

Bacteria sra NCBI • 687 views

ADD COMMENT • link updated 2.5 years ago by Istvan Albert 102k • written 2.5 years ago by Giulia.cosenza ▴ 110

score 1 · Answer 1 · 2022-05-19

The default Kraken2 database operates on the RefSeq data via a so-called assembly summary table that connects a TaxID to an assembly id and a sample name.

The structure of that table is described here:

https://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt

The table for bacteria can be found as assembly_summary_refseq.txt here:

https://ftp.ncbi.nlm.nih.gov/genomes/refseq/

From that you can figure out how the taxids are connected to other information.