Question

Need Advice On The Proper Use Of Glimmer3 For Finding Genes In Microbial Dna

1

Entering edit mode

11.9 years ago

rosarylimyt ▴ 70

I'm an undergrad taking up a beginner's project on metagenomic tools for annotation and visualization of the metabolic pathways. I'm confused by the workings of GLIMMER3. I tried experimenting with the sample files that came with the download to see if I fully understand how to operate GLIMMER3 by reproducing the given results.

The thing is I'm unsure if the scripts given (namely g3-from-scratch, g3-from-training, g3-iterated) are to be used in a consecutive manner.

I'd assumed these scripts had to be used in order starting with g3-from-scratch, and I've generated a number of files. Thereafter I'm stuck at g3-from-training. I downloaded and read through the only tutorial I could find and it says to type the command in the form:

g3-from-training.csh [yourgenom.seq] train.coords run2

I can't find a .coords file generated from the previous g3-from-scratch run and so I can't proceed. Please advise me on what I should do. I wanna progress from here! =[

metagenomics clustering • 3.1k views

ADD COMMENT • link updated 11.9 years ago by Josh Herr 5.8k • written 11.9 years ago by rosarylimyt ▴ 70

score 1 · Answer 1 · 2013-02-09

I've never used GLIMMER but am just now downloading it and will give it a test run including reading over the documentation, so I will try to help out. Please forgive me if I'm not helpful in this post, but no one else has posted anything yet so I thought I would try to contribute.

Typically with metagenomic read clustering or amplicon clustering you can de novo cluster based on sequence similarity (the g3-from-scratch script). For these clustering methods there is no a priori designation, just a measurement of sequence similarity. An advantage of this method is computational speed. One disadvantage is not naming or identifying reads, but you can BLAST or use phylogenetic methods to identify the clustered reads, but these methods are time consuming and particularly when using BLAST and very large databases, may introduce errors.

Other types of clustering algorithms get around using BLAST by using "training sets" which are basically curated databases which can be specific to your research area (for specific example human microbiome or deep sea sediments) and will help to identify the reads based on where they were sampled or presumed taxonomic composition. The train.coords file is this training set. It may be computed from a prior analysis, in which case I'm not sure where it might be located.

Other clustering methods, such as those from RDP or greengenes, as implemented through QIIME, use a matrix or FASTA type file that you specifically create in a text editor that is your training set. In other words, you might have to create your own train.coords file, but I would think one would come with the tutorial, perhaps that's part of the tutorial?

score 1 · Answer 2 · 2013-02-10

So I spent some time going through the tutorial scripts with some of my own data. The g3-from-scratch script is for clustering reads with no prior knowledge about what genes they may contain. The g3-from-training script requires that you provide your own training set, which is why you are having some issues because you need to provide that text file. Lastly, the g3-iterated script is a combination of the two, first it runs the "from scratch" clustering, then generates a .coord file as your "training set", then uses this training set to re-cluster your reads. ...and to answer your question, you don't have to run them in sequence.

I was actually surprised to see that the last update to these scripts was in early 2006, so sometimes it makes me a little afraid to invest time into something that hasn't been maintained in a while. I think there are other tools out there that have been developed in recent years which have similar and more robust functions. That being said, I am going to give GLIMMER a few more runs to really make my mind up on how I feel about it.