Question

How to identify 16s sequences from binning data(contigs)?

8

Entering edit mode

9.7 years ago

liuyifan2014 ▴ 110

Hi everyone, I got some binning data and I want to retrieve 16s sequences from them. The binning data are supposed to be pure single genome, but each of them is composed of several contigs. So it may cause problem if I submit them to 16sr RNA identifier like RNAmmer which requires single genome sequence file. Do you know any program serves this purpose? Many thanks:)~

genome Assembly sequence blast • 16k views

ADD COMMENT • link updated 8 weeks ago by Brian Bushnell 20k • written 9.7 years ago by liuyifan2014 ▴ 110

Ram · Answer 1 · 2015-09-30

16

Entering edit mode

9.7 years ago

Brian Bushnell 20k

BBDuk works well for this purpose, if you have a large set of curated ribosomal sequences (such as Silva). You can run it on the reads or the assemblies.

bbduk.sh in=data.fa outm=ribo.fa outu=nonribo.fa k=31 ref=silva.fasta

Though, I'm not quite clear why you want to remove the 16S.

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you. I don't want to remove them, I just want to extract them for phylogenetic classification.

ADD REPLY • link 9.7 years ago by liuyifan2014 ▴ 110

7

Entering edit mode

I put a link to a set of ribosomal kmers on Google drive:

https://drive.google.com/file/d/0B3llHR93L14wS2NqRXpXakhFaEk/view?usp=sharing

I made it mainly from Silva. It's small (9MB) and you can use it with BBDuk like this:

bbduk.sh in=data.fa outm=ribo.fa outu=nonribo.fa k=31 ref=ribokmers.fa.gz

It has roughly 99.94% sensitivity against the full Silva database.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.7 years ago by Brian Bushnell 20k

1

Entering edit mode

This works quite nice! I would like to build the ribokmer reference also for other genes. I tried it with kmercountexact from bbmap but I can not replicate the file provided on google drive. Can you shed some light on how to this reference set of ribosomal kmers? Thanks!

ADD REPLY • link 9.4 years ago by m.dehollander ▴ 50

2

Entering edit mode

The process was a little involved. I started with the Silva ribosomal database, and followed this procedure:

Deduplicated the sequences with dedupe.sh, since there are lots of redundant copies.
Ran them through kcompress.sh, which produces a fasta file containing all the kmers of interest. I made several different versions; one containing all 31-mers, one containing only 31-mers that occurred at least 2 times, one for 3 times, etc. up to 50 times (there is a flag for kcompress which specifies this). That allows variable sensitivity/specificity; the 1-copy version is the biggest and most sensitive, while the 50-copy version is tiny and tends to only contain the most highly conserved ribosomal kmers (or the ones for the organisms that are most popular to sequence so they are in the database a lot).
Then I generated synthetic data from Silva and tested the sensitivity of the different versions. The point of this was to salvage kmers that were important but missing. So, for example, I ran BBDuk with the 10-copy kmer set and kept the reads that did not match, and added in some of the most common kmers from those nonmatching reads; this keeps the file size small but increases sensitivity. This step is not really necessary, though - you can just do 1 and 2 in general, but this is what I did.

The file on my google drive is, I think, the version in which I kept only kmers present at least 3 times in the deduplicated Sliva database.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.3 years ago by Brian Bushnell 20k

0

Entering edit mode

Do you know which release of silva this file corresponds with? https://www.arb-silva.de/no_cache/download/archive

Would you recommend a different approach today since Silva provide non-redundant files?:

SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz  65 MB
SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz  189 MB

Using those 2 files only take an extra 2 minute in comparison to the 9.5 MB you shared, and it removes the same amount of reads 4.7%

ADD REPLY • link 23 months ago by LayneSadler ▴ 90

0

Entering edit mode

Hi LayneSadler — were you able to resolve this issue? Also, if you ended up creating your own database using a more recent SILVA release, could you share how you did it? I built mine using kcompress.sh, loosely following as outlined above, but the resulting file seems too large, and Java runs out of memory regardless of the settings I try.

ADD REPLY • link 11 weeks ago by Laura • 0

0

Entering edit mode

Hi Laura, did you solve the problem, or was there something you need in addition to the version on my google drive? I do have a couple other smaller versions floating around IIRC.

ADD REPLY • link 8 weeks ago by Brian Bushnell 20k

0

Entering edit mode

Cool,have you got documentation or publication for this tool?

ADD REPLY • link 9.7 years ago by liuyifan2014 ▴ 110

0

Entering edit mode

There's no paper yet. Documentation is in the shell script (it is printed if you run it with no arguments). There's also a thread here explaining common uses.

Edit: There is also now a BBDuk usage guide in the /docs/ directory.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Sorry to resurrect a thread. Would you happen to have the information regarding organisms that each of those reads map to? @Brian Bushnell

ADD REPLY • link 5.6 years ago by lawrence.hsu ▴ 10

Ram · Answer 2 · 2015-10-01

1

Entering edit mode

9.7 years ago

5heikki 11k

ssu-align is pretty great for identifying 16S and 18S

ADD COMMENT • link 9.7 years ago by 5heikki 11k

0

Entering edit mode

Thank you. But ssu-align is specific designed for PCR-based SSU sequences. My data is draft genome data , most of which I think are non-SSU sequences. What's more, the SSU sequences in my file are discrete which my be divided in to many contigs or may be just little piece of a big contige. Do you have any idea about these problems? I will try it anyway :)

ADD REPLY • link 9.7 years ago by liuyifan2014 ▴ 110

1

Entering edit mode

It's hmm based general search tool. I have used it for screening 16S and 18S from metagenomic assemblies. At least in my assemblies, the 16S fragments are generally in the ends of contigs. I haven't looked into it further, but I think it's due to too high sequence similarity (same k-mers) in very conserved regions of the molecule.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.7 years ago by 5heikki 11k

0

Entering edit mode

It works, thank you!

ADD REPLY • link 9.7 years ago by liuyifan2014 ▴ 110

score 1 · Answer 3 · 2015-10-02

1

Entering edit mode

9.7 years ago

h.mon 35k

hmm_RNA uses hidden Markov models to find SSU and LSU on metagenomic assemblies. It is sensitive, finding even small fragments on draft assemblies, but its identification is inaccurate, calling "Archeal" or "Eukariotic" several bacterial sequences.

ADD COMMENT • link 9.7 years ago by h.mon 35k

0

Entering edit mode

It works, thank you!

ADD REPLY • link 9.7 years ago by liuyifan2014 ▴ 110

score 1 · Answer 4 · 2021-05-17

1

Entering edit mode

4.1 years ago

O.rka ▴ 740

I might be a little late to the game but since 2018, BARRNAP made by Torsten Seemann, the guy who made Prokka, https://github.com/tseemann/barrnap has been my GO-TO. He writes really great software.

ADD COMMENT • link 4.1 years ago by O.rka ▴ 740