Hi all,
A short background: I have 4 metagenomes. In all of them, two very closely-related populations of the same bacterial cluster co-occur. I obtained 4 final bins, and according to ANI value, I have two different genomes (ANI around 80).
The point is, when I extract the rRNA genes from the bin (assembled with SPAdes), the rRNA looks identical. Which is not possible - the only explanation I came with is that SPAdes is not able to resolve the 16S, so it will assemble just one instead of two. Roughly speaking, it 'mixes' the reads from the original rRNAs and it ends up with just one sequence.
Is it any tool that would allow me to extract the two different rRNAs? I had a look into reago (edited) but it looks like it is not updated anymore, and there are some parameters to set which are not clear (e.g., -l for the length of the input reads, which is supposed to be uniform).
Thanks all and have a nice weekend!
Stefano
CheckM has an SSU finder option. I do not know the under-the-hood of the source code and whether it's "sensitive".
What do the other single-copy markers look like? Are they identical, too?
Thanks for your answer - I didn't use that CheckM function yet, but I guess it works like rnammer? Also, I didn't do any analysis on other marker genes and that would be a nice idea, but still it won't help justifying the identical 16S content in two different subpopulation.
It is very feasible for 16s to be identical between your bins. E.coli and Shigella - different species - apparently, have nearly identical 16s. In the oceans, all Prochlorococcus strains - from the entire planet, all ocean depths, low and high light types - are within 97% similarity. I'm not saying your 16s should be very close or identical, but if they are indeed locally adapted populations of the same species, it's possible.
I would need more time to investigate CheckM. Maybe it does in fact use the exact same rnammer used by other pipelines, like PROKKA.
Thanks again for your answer - what you say is indeed right; but when the 16S shows >98-99% similarity, I expect the genomes to have ANI values above 94. I cannot figure out how two populations which have quasi-identical 16S (1511/1533 nucleotides matching, 99%) return ANI values of 80. That's why I initially thought that the 16S's were not correctly assembled in the first place. I also have OTUs obtained with tag sequencing supporting the hypothesis that the dominant (sub)populations are two.
Mysteries...
The repository you linked is unsupported, but there is a link to the newer version of the tool.
-l
is the length of the reads, typically 100 or 125 for HiSeq, and 250 or 300 (good luck on this one) for MiSeq. But there seems to be problems associated with read length anyway.About your assembly, did you use SPAdes or metaSPAdes? metaSpades would probably be more appropriate.
You can map your reads back to the assembly, and manually inspect the 16S rRNA regions. You may get an idea if it is indeed possible to untangle the different rRNAs.
Thanks for your answer - I linked the wrong rep (fixed) but I looked into the right one :D good to know about the length. I actually end up in your 'good luck with this one' case, as I have 2x300 bp Illumina.
For the genome reconstruction, the pipeline was:
assembly with metaSPAdes binning with Metawatt extracting the bin of interest mapping reads against the bin reassembly with SPAdes (at this stage we are closer to a genome assembly) rebinning with Metawatt CheckM to check completeness RefineM to refine the bin
I inspected the bin after the first assembly and the reassembly in 2 (actually 3) ways: quick annotation with prokka, rnammer, and manually blasting the rna output of Metawatt. Although I retrieve two genomes as per ANI, the rRNA is identical. I guess the problem comes with the assembly, which is not fine-tuned for 16S rRNA genes assembly.
I thought of using Phyloflash but it uses anyway SPAdes.
My comment regarding MiSeq 2x300bp reads is due to their known poor quality - I hope you are among the luck ones that can get good data out.