Formatting issues using DADA2 with the database EukRIbo
1
1
Entering edit mode
10 months ago

I have been using DADA2 with the Silva and PR2 databases. Both have a training set of data that can be used to assign taxonomy with assignTaxonomy command.

taxa <- assignTaxonomy(seqtab.nochim, "silva_nr_v132_train_set.fa.gz")

I wish to use the EukRibo database (https://zenodo.org/records/6896896) but unfortunately the formatting is not the same.

For example, in Silva the formatting of the species assignement file and training file are:

Training file show taxonomical ranks and sequence:

zcat silva_nr_v132_train_set.fa.gz | head -3

>Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Candidatus_Regiella;
TTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCG

Species alignment file show ID Genus Species:

zcat silva_species_assignment_v132.fa.gz | head -3

>AC201869.46386.47908 Regiella insecticola
AGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGCAGCGGGGAGTAGCTTGCTACTCTGC

However, in EukRibo the fomatting of the full sequences file is:

zcat 46346_EukRibo-02_full_seqs_2022-07-22.fas.gz | head -3

>AB000271 Eukaryota|Diaphoretickes|Sar|Alveolata|Myzozoa|AC-clade|Apicomplexa|CM-group|coccidiomorphea|hematozoans|HP-clade|Piroplasmorida|Theileridae|g:Theileria|Theileria+sergenti
AACCTGGTTGATCCTGCCAGTAGTCATATGCTTGTCTTAAAGATTAAGCCATGCATGTCT

Which seem to be ID and taxonomical rank with different formatting.

Has anyone used EukRibo with DADA2? Is there any way to convert this database for DADA2?

DADA2 Silva fasta metabarcoding EukRibo • 888 views
ADD COMMENT
0
Entering edit mode
10 months ago

Not a definitive answer on this, but an update. It is indeed possible to re-format the EukRibo database with script:

zcat 46346_EukRibo-02_full_seqs_2022-07-22.fas.gz | awk 'BEGIN { FS = "|" } /^>/ { print $1";"$5";"$9";"$12";"$13";"$15";"; next } 1' | tr "+" "_" | sed 's/[^[:space:]]*>\([^[:space:]]*\)[[:space:]]*/>/g' | gzip > 46346_EukRibo-02_full_seqs_2022-07-22_EDIT.fas.gz

But as mentioned in the EukRibo publication, the number of taxonomic units is variable in this database, and it does not match the expected DADA2 outptuts. In the followed example, one genus is at Family rank, and binomial species name at Genus rank.

     Kingdom     Phylum        Class            Order              Family                     Genus
[1,] "Eukaryota" "Chlorophyta" "Sphaeropleales" ""                 ""                         NA   
[2,] "Eukaryota" "Chlorophyta" "Sphaeropleales" ""                 ""                         NA   
[3,] "Eukaryota" "Chlorophyta" "Sphaeropleales" ""                 ""                         NA   
[4,] "Eukaryota" "Nucletmycea" "Ascomycota"     "g:Tetrapisispora" "Tetrapisispora_blattae\r" NA   
[5,] "Eukaryota" "Chlorophyta" "Sphaeropleales" ""                 ""                         NA   
[6,] "Eukaryota" "Chlorophyta" "Sphaeropleales" ""                 ""                         NA  
ADD COMMENT
0
Entering edit mode

So using this, you were able to assign taxonomy to sequences using the reformatted eukribo database? I'm looking to do the same thing but have little experience in this area

ADD REPLY
0
Entering edit mode

It did work, but the results showed mismatched taxonomic ranks (as shown above), and the species were not matching at all what we were expecting. So I personally gave up analysis with EukRibo and use PR2 for Eukaryotes identification.

ADD REPLY

Login before adding your answer.

Traffic: 2842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6