How to identify 16s sequences from binning data(contigs)?
4
8
Entering edit mode
9.1 years ago
liuyifan2014 ▴ 110

Hi everyone, I got some binning data and I want to retrieve 16s sequences from them. The binning data are supposed to be pure single genome, but each of them is composed of several contigs. So it may cause problem if I submit them to 16sr RNA identifier like RNAmmer which requires single genome sequence file. Do you know any program serves this purpose? Many thanks:)~

genome Assembly sequence blast • 15k views
ADD COMMENT
16
Entering edit mode
9.1 years ago

BBDuk works well for this purpose, if you have a large set of curated ribosomal sequences (such as Silva). You can run it on the reads or the assemblies.

bbduk.sh in=data.fa outm=ribo.fa outu=nonribo.fa k=31 ref=silva.fasta

Though, I'm not quite clear why you want to remove the 16S.

ADD COMMENT
0
Entering edit mode

Thank you. I don't want to remove them, I just want to extract them for phylogenetic classification.

ADD REPLY
7
Entering edit mode

I put a link to a set of ribosomal kmers on Google drive:

https://drive.google.com/file/d/0B3llHR93L14wS2NqRXpXakhFaEk/view?usp=sharing

I made it mainly from Silva. It's small (9MB) and you can use it with BBDuk like this:

bbduk.sh in=data.fa outm=ribo.fa outu=nonribo.fa k=31 ref=ribokmers.fa.gz

It has roughly 99.94% sensitivity against the full Silva database.

ADD REPLY
1
Entering edit mode

This works quite nice! I would like to build the ribokmer reference also for other genes. I tried it with kmercountexact from bbmap but I can not replicate the file provided on google drive. Can you shed some light on how to this reference set of ribosomal kmers? Thanks!

ADD REPLY
2
Entering edit mode

The process was a little involved. I started with the Silva ribosomal database, and followed this procedure:

  1. Deduplicated the sequences with dedupe.sh, since there are lots of redundant copies.
  2. Ran them through kcompress.sh, which produces a fasta file containing all the kmers of interest. I made several different versions; one containing all 31-mers, one containing only 31-mers that occurred at least 2 times, one for 3 times, etc. up to 50 times (there is a flag for kcompress which specifies this). That allows variable sensitivity/specificity; the 1-copy version is the biggest and most sensitive, while the 50-copy version is tiny and tends to only contain the most highly conserved ribosomal kmers (or the ones for the organisms that are most popular to sequence so they are in the database a lot).
  3. Then I generated synthetic data from Silva and tested the sensitivity of the different versions. The point of this was to salvage kmers that were important but missing. So, for example, I ran BBDuk with the 10-copy kmer set and kept the reads that did not match, and added in some of the most common kmers from those nonmatching reads; this keeps the file size small but increases sensitivity. This step is not really necessary, though - you can just do 1 and 2 in general, but this is what I did.

The file on my google drive is, I think, the version in which I kept only kmers present at least 3 times in the deduplicated Sliva database.

ADD REPLY
0
Entering edit mode

Do you know which release of silva this file corresponds with? https://www.arb-silva.de/no_cache/download/archive

Would you recommend a different approach today since Silva provide non-redundant files?:

SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz  65 MB
SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz  189 MB

Using those 2 files only take an extra 2 minute in comparison to the 9.5 MB you shared, and it removes the same amount of reads 4.7%

ADD REPLY
0
Entering edit mode

Cool,have you got documentation or publication for this tool?

ADD REPLY
0
Entering edit mode

There's no paper yet. Documentation is in the shell script (it is printed if you run it with no arguments). There's also a thread here explaining common uses.

Edit: There is also now a BBDuk usage guide in the /docs/ directory.

ADD REPLY
0
Entering edit mode

Sorry to resurrect a thread. Would you happen to have the information regarding organisms that each of those reads map to? @Brian Bushnell

ADD REPLY
1
Entering edit mode
9.1 years ago
5heikki 11k

ssu-align is pretty great for identifying 16S and 18S

ADD COMMENT
0
Entering edit mode

Thank you. But ssu-align is specific designed for PCR-based SSU sequences. My data is draft genome data , most of which I think are non-SSU sequences. What's more, the SSU sequences in my file are discrete which my be divided in to many contigs or may be just little piece of a big contige. Do you have any idea about these problems? I will try it anyway :)

ADD REPLY
1
Entering edit mode

It's hmm based general search tool. I have used it for screening 16S and 18S from metagenomic assemblies. At least in my assemblies, the 16S fragments are generally in the ends of contigs. I haven't looked into it further, but I think it's due to too high sequence similarity (same k-mers) in very conserved regions of the molecule.

ADD REPLY
0
Entering edit mode

It works, thank you!

ADD REPLY
1
Entering edit mode
9.1 years ago
h.mon 35k

hmm_RNA uses hidden Markov models to find SSU and LSU on metagenomic assemblies. It is sensitive, finding even small fragments on draft assemblies, but its identification is inaccurate, calling "Archeal" or "Eukariotic" several bacterial sequences.

ADD COMMENT
0
Entering edit mode

It works, thank you!

ADD REPLY
1
Entering edit mode
3.5 years ago
O.rka ▴ 740

I might be a little late to the game but since 2018, BARRNAP made by Torsten Seemann, the guy who made Prokka, https://github.com/tseemann/barrnap has been my GO-TO. He writes really great software.

ADD COMMENT

Login before adding your answer.

Traffic: 932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6