Question

extremely low classified reads from sequencing gut microbiome data, applying reads to another reference database?

1

Entering edit mode

3.6 years ago

jobie1 ▴ 30

I am new to bioinformatics and have a question about potentially using the Sequence Reach Archive for taxonomic classification to improve the amount of classified reads obtained from a sample using WGS. I am not sure if this is a good idea, or even possible since it is more of an archive after all, it seems most people download single sequences from the SRA to use in experiments.

Here is a general explanation of this class project I working on for an intro class to bioinformatics: I have some metagenomic data from one seabird gut microbiome sample that was sequenced using Illumina MiSeq (paired-end mode, length approx. 400-500 bp (300+300 bp with ~100-200 bp overlap), it was already analyzed using this workflow: https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-Standard-Operating-Procedure-v3. According to this workflow, they use Kraken2+Bracken and the NCBI Complete RefSeq database. The amount of classified reads that came back from the analysis was around 5%, extremely low!

So for the class project I want to try and increase the number of classified reads for this sample. At the advice of my instructor, I was going to try and use the Sequence Read Archive as an alternative database to classify the reads taxonomically. Since there is a larger amount of data in that archive and will probably give a higher level of classification, is this even possible to do?

Note: the Github linked above they recommended using their VM, VirtualBox (Microbiome Helper VBox), to run small-scale metagenomic analyses on your laptop, so I started my analysis by running through the code provided on their Github. Now that I've come to the step where I have to classify these reads to a new reference database, I'm stuck.

Seeing as SRA is such a large database, I was trying to learn how to download their databases through AWS, but I would have to do this through their virtual machine to run their code even if it is possible. Since I am so new to this I am having trouble understanding how I can proceed now, if anyone has any advice I would really appreciate it. Thank you!

SRA metagenomics WGS bracken kraken2 • 2.8k views

ADD COMMENT • link 3.6 years ago by jobie1 ▴ 30

score 2 · Answer 1 · 2021-11-14

Can you confirm that you are working with shotgun metagenomic data and not 16S rRNA amplicons? It seems strange that shotgun metagenomic data would be generated on an Illumina MiSeq instrument and have a 100-200 bp overlap -- this sounds like an amplicon library, which might require a different reference database (e.g., SILVA, GreenGenes, or RDP).

To answer your original question, building a database from the SRA is probably not feasible in this context, because you wouldn't have any information about the taxonomy of those reads, which Kraken2 requires for building a custom database. If you are using a MiniKraken database, then that might explain the low classification rate; if you are using the standard Kraken database, then this would be somewhat surprising.

However, it might be that birds have a majority of microbes that simply are not present in current reference database, hence the low classification rate (still surprising though). Have you checked the proportion of host (e.g., bird) vs. non-host (e.g., microbial) reads in your dataset? Maybe your sample is overwhelmed with bird DNA and has very little microbial DNA -- it seems unlikely for a gut sample, but might be worth checking out.

I would recommend revisiting the quality control, trimming and classification parameters before moving forward. I hope this helps!

GenoMax · Answer 2 · 2021-11-15

1

Entering edit mode

3.6 years ago

colindaven 7.5k

Chris's answer is very good. A few notes from my side:

The SRA is a READ archive, not a genome archive. Don't use it.
You could try a few blasts to see if you can get anything else other that what you have already. I wrote a cluster blast for this using nextflow, which is ok for ~10-100k reads mostly
try centrifuge/metaphlan3 as an alternative
check your reads aren't all duplicates / rubbish with fastqc/multiqc
include the host genome (even a bunch of combined eg cat'ted contigs if available) in any custom reference sequence you put in.

ADD COMMENT • link 3.6 years ago by colindaven 7.5k

0

Entering edit mode

Hi, thanks for your help. The fastqc did not yield terrible results, it didn't dip below a phred score of 20 at all.

When I stitch the reads of the sample using PEAR it gives me a warning to check the summary log because less than 90% of my reads were assembled. Do you know why this could be?

ID      Assembled   Discarded   Unassembled

Min     77.490      0.258       22.253

Mean    77.490      0.258       22.253

Max     77.490      0.258       22.253

COMU02-MN_S5    77.490      0.258       22.253

ADD REPLY • link updated 3.6 years ago by GenoMax 151k • written 3.6 years ago by jobie1 ▴ 30

1

Entering edit mode

johanna : Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically organized.

ADD REPLY • link 3.6 years ago by GenoMax 151k

0

Entering edit mode

When I stitch the reads of the sample using PEAR it gives me a warning to check the summary log because less than 90% of my reads were assembled. Do you know why this could be?

This means your inserts as short and the paired end reads are overlapping in the middle when they sample the library fragment.

ADD REPLY • link 3.6 years ago by GenoMax 151k

0

Entering edit mode

Sounds alright, I wouldn't be worried. But I think you need to answer key questions first

is this MiSeq data (2x300bp can only be MiSeq, as Nextseq can only do 2x150bp)
is this 16S amplicon data from the 16S gene only ..... or is it whole genome metagenomics (WGS) data ?

You need completely different pipelines for the two approaches.

For 16S I would recommend ampliseq https://nf-co.re/ampliseq/2.1.1/output

ADD REPLY • link 3.6 years ago by colindaven 7.5k

0

Entering edit mode

It is WGS (NextSeq)

The operating procedure I'm using now is for metagenomics analysis, SOPs linked on my original post.

ADD REPLY • link 3.6 years ago by jobie1 ▴ 30