Question

Which Is An Acceptable Pipeline From Sequencing To Deep Annotation?

2

Entering edit mode

10.9 years ago

Lluís R. ★ 1.2k

I know there are several approach depending on which are your goals. But we can agree that all the genome projects have at least a sequencing, an assembly, and annotation phases. How should this work done?
I am not entering in the details of which type of sequencer it is used and therefore which kind of processing follow, I want to know how this steps should work, and which tools can do it.

UPDATE: As pointed in by one answer the actual structure goes to DNA-seq to annotation, for Exome-seq, and RNA-seq to annotation it would be the same?

The steps I think that are in processing order are:

Sequencing: Each company program
Assembly: Galaxy, Maker

Once here we have a "finished" genome we can go to the annotation part, which in many cases it can be done in parallel:

Genes(protein coding): Glimmer, Prodigal, August, Gene Mark
- Intron detection: RNAweasel
RNA:
- tRNA: tRNAscan-SE, ARAGORN
- ncRNA: RNAspace, INFERNAL
- rRNA: RNAmmer, RDP, SSU-ALIGN
- Classification: RFAM
Repeats: CRISPR
Localitzation: SignalP, WolfPsort
Proteins classification: GO, PFAM, KEGG, PSI-BLAST

Are this steps and tools correct? What can be added?

pipeline assembly • 3.9k views

ADD COMMENT • link 10.9 years ago by Lluís R. ★ 1.2k

1

Entering edit mode

Doing an assembly on Galaxy is not a good idea, unless you are working on a small genome. Also check out http://www.yandell-lab.org/software/maker.html

ADD REPLY • link 10.9 years ago by Zev.Kronenberg 12k

0

Entering edit mode

So, aside from this, these pipeline should work good "in general". What other tool might I miss?

ADD REPLY • link 10.9 years ago by Lluís R. ★ 1.2k

0

Entering edit mode

You could throw RNAweasel there for intron detection. Also, depending what you're sequencing, you might have to take into account the number of different genetic codes in quite a few occasions..

ADD REPLY • link 10.9 years ago by 5heikki 11k

0

Entering edit mode

Added! This just affects when looking for putative proteins, and not all the genes finders translate the sequence to find if it is a protein coding region or not. But good point.

ADD REPLY • link 10.9 years ago by Lluís R. ★ 1.2k

Alex Paciorkowski · Answer 1 · 2014-01-29

1

Entering edit mode

10.9 years ago

Ying W ★ 4.3k

Its a bit ambiguous what you mean by 'deep annotation' do you mean going from Exome-seq data to annotation or from RNA-seq data to annotation or from DNA-seq data to annotation? From the rest of your question, I am guessing you are talking about the last one.

For sequencing, there are a lot of QC steps involved to make sure that the data you are getting is the data you expect. You can find some of these tools in galaxy.

For assembly, you can find some information about the different methods here

For annotation, you have a pretty good list of tools already but another tool you could use is to use PSI-BLAST to compare predicted proteins with known proteins. This guide (direct link to 18 Mb pdf) may be of interest. Looking at how other published papers of denovo genomes assemblies of organisms close to the one you are working and they tools they used to annotate might also be useful.

ADD COMMENT • link updated 10.9 years ago by Alex Paciorkowski 3.5k • written 10.9 years ago by Ying W ★ 4.3k

0

Entering edit mode

Well, I am working from DNA-seq to annotation, but that's one of the things I miss. So in the Exome-seq and in the RNA-seq annotation how it would be? At the moment I add the PSI-BLAST to compare proteins. Thanks for the help.

ADD REPLY • link 10.9 years ago by Lluís R. ★ 1.2k

0

Entering edit mode

Exome-Seq is generally used to specifically refer to some sort of targeted resequencing where you are doing a capture method on genomic DNA and sequencing only exonic regions. This means you are working with a reference sequence and not doing an unsequenced genome.

ADD REPLY • link 10.9 years ago by DG 7.3k

0

Entering edit mode

So, here after sequencing the target exomes it would be needed to paste it together to get the translated region? That would mean select the right order of exons, and check if the translated result is in a database of proteins?

ADD REPLY • link 10.9 years ago by Lluís R. ★ 1.2k

0

Entering edit mode

What data are you working with? I was pointing out that if you are not working with an organism that already has a sequenced genome, including gene definitions, you can't do Exome-Seq really, because you don't know what you are targeting. You can of course do RNA-Seq.

ADD REPLY • link 10.9 years ago by DG 7.3k

0

Entering edit mode

I am working with a sequenced organism including gene definitions. But this question intends to be wider. Nevertheless thanks

ADD REPLY • link 10.9 years ago by Lluís R. ★ 1.2k