Question

Functional annotation of metagenomic contigs

1

Entering edit mode

5.5 years ago

ARich ▴ 130

Dear Biostar users,

Its a very naive question but i certainly lack some clarity here. I have contigs from megahit and would like to perform functional annotation. For this purpose i first used prodigal for gene prediction and the prokka for annotation.

My question here is if I would like to do functional annotation on my own instead of using prokka. What do I have to do? I mean which tools and references?

Any suggestion will be of great help!

Cheers! AR

assembly • 3.5k views

ADD COMMENT • link updated 2.9 years ago by Mensur Dlakic ★ 28k • written 5.5 years ago by ARich ▴ 130

0

Entering edit mode

Prokka is an amazing software. From convenience of use to results it has never failed me. It is a collection of different softwares and the best place to start (if you don't want to use Prokka) would be Prokka itself i.e looking at what softwares Prokka uses to get annotations. I hope I am making sense here.

ADD REPLY • link 5.5 years ago by microfuge ★ 2.0k

0

Entering edit mode

Thank you for the reply. Question: 1. Can prokka take all kingdom like this --kingdom 'Archaea|Bacteria|Mitochondria|Viruses' ? 2. Can I also use prokka on bins from Maxbin2 output? If yes then, should it be called on each bin similar to how it is called on contigs?

ADD REPLY • link 5.5 years ago by ARich ▴ 130

score 2 · Answer 1 · 2019-07-01

2

Entering edit mode

5.5 years ago

Mensur Dlakic ★ 28k

Seems like you already have binned your sequences. If so, the next step is to annotate the bins for completeness and assign them into general taxonomic categories (if possible). A tool for that is CheckM. Its output looks something like this:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id              Marker lineage        # genomes   # markers   # marker sets    0     1     2    3   4   5+   Completeness   Contamination   Strain heterogeneity
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  group_000000     k__Archaea (UID146)          59         174           136         8    163    3    0   0   0       94.85            2.21               0.00
  group_000001      k__Archaea (UID2)          207         149           107         70    79    0    0   0   0       51.82            0.00               0.00
  group_000002    k__Bacteria (UID3060)        138         338           246        101   237    0    0   0   0       66.66            0.00               0.00
  group_000003   p__Euryarchaeota (UID3)       148         187           124         9    177    1    0   0   0       92.74            0.81               0.00
  group_000004      k__Archaea (UID2)          207         149           107         32   117    0    0   0   0       83.64            0.00               0.00
  group_000005   c__Thermoprotei (UID147)       54         217           168         4    212    1    0   0   0       98.21            0.60               0.00
  group_000006      k__Archaea (UID2)          207         149           107         5    105    38   1   0   0       95.79           19.52               0.00
  group_000007      k__Archaea (UID2)          207         149           107         1     8    140   0   0   0       99.07           92.21               0.00
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

You should probably copy and paste the lines above into a wider screen so you can read them properly. Anyway, it shows that the first two bins in this sample are Archaea and the next one is Bacteria, so you can use that to specify the kingdom using Prokka. I don't think you can tell Prokka to look at all kingdoms.

My recommendation is to annotate using Prokka rather than manually. We are talking here 10 minutes vs. many hours or even days, and I am still not sure that manual annotation would be more successful. If you truly feel that your sequence annotation ability is much better than that of Prokka, you can always continue from Prokka annotation and tackle uncharacterized proteins.

ADD COMMENT • link 5.5 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

That you for this detailed reply!

You recommend prokka as well! Question: Should your recommend to contatenate all checkM bins to create one single file which can then be used as a input for prokka? or Do you run prokka on individual bins? What is recommended to classify bins for taxonomy?

Cheers! AR

ADD REPLY • link 5.5 years ago by ARich ▴ 130

1

Entering edit mode

Prokka should run on individual bins. If you concatenate them, it would be the same as running it on the whole metagenomic assembly.

After binning, create .fasta files for each bin and put them in the same directory. In the example above they were named group_00000X.fasta. After you run CheckM according to their instructions, the second column of the output (see above) will be the taxonomic classification. That may be only at the level of kingdom, or go all the way down to genus. Either way, it will provide enough information so you can assign kingdom in Prokka.

For some bins there will be no annotation because CheckM only annotates prokaryotes. Those bins could be viruses, eukaryotes, or short contigs that can't be annotated conclusively.

ADD REPLY • link 5.5 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

This is really of great help. I am taking oppurtunity to ask more :) Actually CheckM provides very low resolution classification and I order to have better resolution I was running blastn with Nt database on each combined bin (as shown below). Do you think its a good idea to do so? Or can you recommend something better.

I was running blast like this

  blastn \
  -task megablast \
  -num_threads 16 \
  -db nt \
  -outfmt '6 qseqid qstart qend qlen sseqid staxids sstart send bitscore evalue nident length' \
  -query metabat_all_contigs.fa > \
  metabat_megablast_out.raw.tab

But as you mentioned combing the bin would be similar to running on contigs. So here shall i run the above blastn only each bins separately?

I apologize of asking basic stuff but I am quite to binning and its workflows.

Cheers! AR

ADD REPLY • link 5.5 years ago by ARich ▴ 130

0

Entering edit mode

The degree of annotation granularity by CheckM will depend on your sample. For example, c__Thermoprotei (UID147) is a very clear-cut annotation, and k__Archaea (UID146) obviously is less so. Getting a better resolution depends on how much time you want to spend and for what purpose. For example, if you want to know just for internal use what the most likely annotation is, you could blast 5-10 largest contigs against the NT database and see if there is some kind of consensus regarding the best match. If top hits for all of them are the same and the identity is fairly high, that will probably do the trick.

To publish your finding, you will have to be more rigorous. There are many ways to do it, but you can start with 16S rRNA (if available), build a tree with a representative set of species and see where yours is slotted. The same can be done with a concatenated set of proteins. Pick a paper from a reputable journal describing an annotation of a new species from metagenomic data and most of these steps will be described in greater detail.

ADD REPLY • link 5.5 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Thank you for detailed explanation. Its was really helpful. As you mentioned to publish one need more rigorous analysis. Can you suggest any good paper showing this section in detail?

ADD REPLY • link 5.5 years ago by ARich ▴ 130

0

Entering edit mode

Hello! I am running some metagenomic data on Galaxy server. I assembled the paired-end reads into contigs and scaffolds through metaSPAdes, which have been then clustered into bins with MaxBin2. I assessed quality of these bins with CheckM. Now I want to annotate these bins using Prokka but a fatal error keeps appearing: Argument "1.7.8" isn't numeric in numeric lt (<) at /usr/local/bin/prokka line 259. Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/corral4/main/jobs/040/757/40757985/_job_tmp -Xmx28g -Xms256m Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/corral4/main/jobs/040/7 I uploaded the bins in fasta format so I dont know which could be the problem. Thank you

ADD REPLY • link 2.9 years ago by Maria • 0

0

Entering edit mode

Most likely you need to install the newest prokka version, as this seems to be the problem with recognizing program versions. While you are at it, I suggest you make sure that all your other programs needed by prokka are up to date. Lastly, your question has nothing to do with the original post, and should be posted in the separate thread rather than as a comment here. Only people who are already in this almost 3-year-old thread are likely to see a question, so you are limiting your audience.

ADD REPLY • link 2.9 years ago by Mensur Dlakic ★ 28k