Question

Formatting issues using DADA2 with the database EukRIbo

1

Entering edit mode

10 months ago

Begonia_pavonina ▴ 200

I have been using DADA2 with the Silva and PR2 databases. Both have a training set of data that can be used to assign taxonomy with assignTaxonomy command.

taxa <- assignTaxonomy(seqtab.nochim, "silva_nr_v132_train_set.fa.gz")

I wish to use the EukRibo database (https://zenodo.org/records/6896896) but unfortunately the formatting is not the same.

For example, in Silva the formatting of the species assignement file and training file are:

Training file show taxonomical ranks and sequence:

zcat silva_nr_v132_train_set.fa.gz | head -3

>Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Candidatus_Regiella;
TTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCG

Species alignment file show ID Genus Species:

zcat silva_species_assignment_v132.fa.gz | head -3

>AC201869.46386.47908 Regiella insecticola
AGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGCAGCGGGGAGTAGCTTGCTACTCTGC

However, in EukRibo the fomatting of the full sequences file is:

zcat 46346_EukRibo-02_full_seqs_2022-07-22.fas.gz | head -3

>AB000271 Eukaryota|Diaphoretickes|Sar|Alveolata|Myzozoa|AC-clade|Apicomplexa|CM-group|coccidiomorphea|hematozoans|HP-clade|Piroplasmorida|Theileridae|g:Theileria|Theileria+sergenti
AACCTGGTTGATCCTGCCAGTAGTCATATGCTTGTCTTAAAGATTAAGCCATGCATGTCT

Which seem to be ID and taxonomical rank with different formatting.

Has anyone used EukRibo with DADA2? Is there any way to convert this database for DADA2?

DADA2 Silva fasta metabarcoding EukRibo • 888 views

ADD COMMENT • link 9 months ago by Begonia_pavonina ▴ 200

score 0 · Answer 1 · 2024-01-18

Not a definitive answer on this, but an update. It is indeed possible to re-format the EukRibo database with script:

zcat 46346_EukRibo-02_full_seqs_2022-07-22.fas.gz | awk 'BEGIN { FS = "|" } /^>/ { print $1";"$5";"$9";"$12";"$13";"$15";"; next } 1' | tr "+" "_" | sed 's/[^[:space:]]*>\([^[:space:]]*\)[[:space:]]*/>/g' | gzip > 46346_EukRibo-02_full_seqs_2022-07-22_EDIT.fas.gz

But as mentioned in the EukRibo publication, the number of taxonomic units is variable in this database, and it does not match the expected DADA2 outptuts. In the followed example, one genus is at Family rank, and binomial species name at Genus rank.

     Kingdom     Phylum        Class            Order              Family                     Genus
[1,] "Eukaryota" "Chlorophyta" "Sphaeropleales" ""                 ""                         NA   
[2,] "Eukaryota" "Chlorophyta" "Sphaeropleales" ""                 ""                         NA   
[3,] "Eukaryota" "Chlorophyta" "Sphaeropleales" ""                 ""                         NA   
[4,] "Eukaryota" "Nucletmycea" "Ascomycota"     "g:Tetrapisispora" "Tetrapisispora_blattae\r" NA   
[5,] "Eukaryota" "Chlorophyta" "Sphaeropleales" ""                 ""                         NA   
[6,] "Eukaryota" "Chlorophyta" "Sphaeropleales" ""                 ""                         NA