Question

Ribosomal reads in shotgun metagenomics data

0

Entering edit mode

6.9 years ago

grp2009 ▴ 60

I am looking at some shotgun (not amplicon) metagenomics data, and have observed that among the reads that are classified as belonging to specific bacteria, most are from ribosomal genes (as determined later by BLAST). This is despite the fact that this is not targeted amplicon sequencing. My interpretation is that most of the bacteria in the sample are absent from the reference database used for classification, but that due to the high level of conservation of the ribosomal genes, these are still appearing in the classification results because those portions of the genomes are "close enough" to previously sequenced genomes.

My first question is: is this a plausible interpretation of what I'm observing? Follow-up: is it a common issue with shotgun metagenomics? Secondly (let me know if this should be a separate question): is there an efficient way to "fish out" previously unclassified reads based on their overlap with a particular set of ribosomal reads from the data? I suppose this would amount to doing genome assembly, but using certain selected reads as a target or seed for assembly.

Background: What we have is Illumina paired-end (2x150bp) data from shotgun metagenomics, which I have run through Kraken (using the 8GB Minikraken database). The first thing I notice is that 99.9% of the reads are unclassified. That seems to hold true with other methods of classification (Metaphlan2 and a cursory BLASTing of a few reads). A small fraction of reads are classified as belonging to certain bacteria. I mapped those reads to the corresponding genome using Bowtie2, hoping to validate the presence of that bug in the sample. After mapping, I see very clear peaks in coverage, rather than reads mapping throughout the genome. Furthermore, the mapped reads BLAST to ribosomal sequences.

metagenomics sequencing microbiome • 1.9k views

ADD COMMENT • link updated 6.8 years ago by Biostar 20 • written 6.9 years ago by grp2009 ▴ 60

0

Entering edit mode

is there an efficient way to "fish out" previously unclassified reads based on their overlap with a particular set of ribosomal reads from the data?

You can bin/fish out reads from a dataset with bbsplit.sh from BBMap and a list of reference sequences you are interested in.

I suppose this would amount to doing genome assembly, but using certain selected reads as a target or seed for assembly.

How so?

ADD REPLY • link 6.9 years ago by GenoMax 147k

0

Entering edit mode

From what I understand, bbsplit would allow me to map reads to reference genomes. In contrast, I'm talking about reads that do not map to any known reference genome. I would like to connect these unmapped reads (by overlapping/assembling) to the reads that mapped (imperfectly) to known ribosomal genes. In essence, this amounts to doing assembly of the metagenomic reads, and then picking the contig that includes the ribosomal sequence of interest. But it would be nice to be able to do this without attempting a complete assembly of all the reads.

To put it another way: in my metagenomic data I have reads that are "close" to the ribosomal genes of bacterium B (close enough to map or BLAST). But I don't think that I have bacterium B in the sample, because the rest of its genome is entirely missing from the my data. Instead, I think I have some other related bacterium, X, which is not represented in the current reference databases. I want to try to get as much as possible of X's genome, by seeing what reads overlap with the ribosomal reads (hence my use of the word "assembly").

ADD REPLY • link 6.9 years ago by grp2009 ▴ 60