I know there are several approach depending on which are your goals. But we can agree that all the genome projects have at least a sequencing, an assembly, and annotation phases. How should this work done?
I am not entering in the details of which type of sequencer it is used and therefore which kind of processing follow, I want to know how this steps should work, and which tools can do it.
UPDATE: As pointed in by one answer the actual structure goes to DNA-seq to annotation, for Exome-seq, and RNA-seq to annotation it would be the same?
The steps I think that are in processing order are:
- Sequencing: Each company program
- Assembly: Galaxy, Maker
Once here we have a "finished" genome we can go to the annotation part, which in many cases it can be done in parallel:
- Genes(protein coding): Glimmer, Prodigal, August, Gene Mark
- Intron detection: RNAweasel
- RNA:
- tRNA: tRNAscan-SE, ARAGORN
- ncRNA: RNAspace, INFERNAL
- rRNA: RNAmmer, RDP, SSU-ALIGN
- Classification: RFAM
- Repeats: CRISPR
- Localitzation: SignalP, WolfPsort
- Proteins classification: GO, PFAM, KEGG, PSI-BLAST
Are this steps and tools correct? What can be added?
Doing an assembly on Galaxy is not a good idea, unless you are working on a small genome. Also check out http://www.yandell-lab.org/software/maker.html
So, aside from this, these pipeline should work good "in general". What other tool might I miss?
You could throw RNAweasel there for intron detection. Also, depending what you're sequencing, you might have to take into account the number of different genetic codes in quite a few occasions..
Added! This just affects when looking for putative proteins, and not all the genes finders translate the sequence to find if it is a protein coding region or not. But good point.