Entering edit mode
6.8 years ago
nattzy94
▴
60
Sorry, this is a really basic question. I've downloaded Glimmer 3.02 and installed it according to the instructions found in the notes but have no idea how to input my gene sequence into the software or how to use it all for that matter.
Hope someone can help.
This PDF should also be in your local software download.
Hi genomax, thanks for the reply. I've read the document and got up to the installation bit but can't understand how to run it. For the first step - building the ICM - how do I get those sequences and how do I input it into the program?
Also, not sure if the issue is because I am using a mac and it might not be compatible.
See: Running Glimmer: Training On Closely Related Species Sequences
While the directions say the program is macOS compatible it is almost 10+ years old at this time and macOS has undergone many changes since that time. If you were able to get the program to compile and run you are probably fine to proceed.
OK, thanks a lot. I will work through it. Just to check am i supposed to be working through the terminal window that opens when I open build-icm? Cos I can't seem to type anything on the terminal.
This is a command line program that you have to run through terminal. Did you get the source code and compile the program on macOS? A linux binary will not work on macOS as is.
I downloaded Glimmer from https://ccb.jhu.edu/software/glimmer/ onto my mac and followed the installation instructions. Is that correct?
That is correct. After compiling the program do you have an executable called
glimmer
that you are able to run and produce help output from?Yes, I get a whole bunch of unix executable files in the 'bin' folder. However, when I open them, all I get is a terminal window in which I can't type anything.
screenshot of the terminal window after opening Glimmer: https://ibb.co/hLsbSx
Looks like you have managed to get the package compiled.
You can't double click these executables and run them (like you normally would other GUI based programs). Instead you have to open the terminal program and use them in the terminal window itself (apple key + space --> search for
terminal
and then openterminal
program).Using
glimmer3
via command line is going to require basic understanding of unix command line (I am going to guess that you are not familiar with this). If yes, then I am going to suggest that you spend a half-day going over basics of UNIX for biologists using this excellent tutorial. It would be the best time investment for future.BTW: What are you planning to use
glimmer
for? There may be newer tools you could use instead.Ok, thanks for the help so far. Will take a look at the tutorial.
I am trying to replicate methods for gene prediction and functional annotation in this paper: http://aem.asm.org/content/82/24/7063.full
Could you suggest some of the newer tools? For bacterial gene finding and annotation, I tried Prokka but it doesn't seem to work well (predicts way too many CDS). So I'm thinking of going back to tried and trusted glimmer.
Ive never had an issue with prokka before, are you sure you dont have some contaminant in your assemblies or something?
I aligned to my known reference (E. coli) and visually everything seemed okay. But then Prokka got ~10000 CDS whereas one should only expect 4000-5000. Would the fact that my assembly is only based on nanopore reads have something to do with it? I'm guessing the indel errors could cause a lot of frameshifts and produce spurious ORFs (about half of the predicted genes are 'hypothetical proteins' so subtracting those would bring me to a more realistic figure). But then again, would frameshifts cause a literal doubling of predicted genes? Possible, but unlikely.
Edit: the hypothetical proteins are mostly quite short (below 500 bp)
That seems possible, but I'm not overly familiar with nanopore data. That number would seem on the high side to just be from frameshifts etc.
What is the assembly like (N50 etc)? Prokka normally errs on the conservative side when calling genes, so I strongly suspect it isn't prokka that's the issue here.
(This may be worth opening a new forum question for).
It's a single contig! I also thought Prokka would be more conservative. Maybe it's the parameters I'm not using? I'm using the most basic command:
Hmm, how much depth of coverage did you end up with from the ONT data?
I would suggest using a more elaborate command yes. I typically use some or all of the following options (some are optional and proteins is only relevant if you have a database of trusted proteins - which may help you out if you do in that case)
How long is the genome you've ended up with?
Coverage is more than a 1000X, which is quite excessive. It ended up almost exactly as long as the reference, ~5Mb.
Thanks! Let me try play around with the parameters.
That should be more than enough coverage for accuracy, but that could have caused its own issues.
I would try downsampling it to between 100-200X coverage and try reassembling. I'm not sure if this is still an issue for long read assemblers, for short read assemblers, having too much coverage can make them choke.
Good point, although I'm not sure that would help. From a macro level, the assembly is correct. Substitution errors are also fairly low, considering these are nanopore reads. The problem is still the indels errors which are systemic to nanopore reads causing frameshifts. I'm just surprised that they would mess with gene prediction so significantly.
Is there a reason you are depending on gene prediction? There are plenty of E. coli genomes available and aligning to closest one should give you an idea of where genes are (and errors in your assembly).
That's right, although I'm mainly interested in using E. coli as pipeline validation. If I can get decent results with E. coli then I could be more confident going forwards that novel genomes, for which I don't have a reference, would do similarly well.
It would be inappropriate to assume that if the procedure works for a well known genome, it will work for others especially unknown ones. Every dataset is going to be unique and will need individual attention (if you are interested in getting accurate assemblies).
Your result with Escherichia are illustrating this already.
So I downsampled to 200X, reassembled with Flye and used more specific parameters in Prokka.
I still get ~10000 CDS in my Prokka output. I guess we can rule out coverage messing up the assembly.
I think it must just be the indel rate in your assembly in that case. I think you'll need to throw some short reads into the mix, but it depends what the end goal is
Yep, I think to get accurate gene annotations, I'll have to resort to short reads!
I have not been able to install GLIMMER on Windows 10. Please I need help on this. What I am trying to use it for is to check for the availability of some known genes like ArsD, ArsC and others in some bacteria genomes that I have downloaded their sequences in FASTA format. Please, I need enlightenment on how how I can achieve this either with glimmer or using any other reliable protocol. Thank you.
Please open a new question with as much detail as you can. Answers are reserved for actual answers to the OP question/post.
I'll give you a hint right now though: abandon Windows 10 (install the Linux Subsystem or something equivalent.)