Automate High-Volume Sequence Annotation And Genbank Submission
2
4
Entering edit mode
13.9 years ago

Hi,

I am in the process of submitting a few (200) cDNA sequences to Genbank and the process is not exactly painless. Since there was a 'relatively' small number of sequences, these were annotated by hand and the 5-column table required for the BankIt submission also made by hand. As you could probably guess, there has been a few mistakes introduced in both format and content. In the near future, I may have to submit up to 10000 annotated cDNA sequences, so of course the manual approach doesn't scale up very well.

I am now writing a pipeline that would take for input a fasta file and two blastx result files (against swissprot and nr databases). It then finds the best annotation, trims and reverse-complements the sequences as needed, keeps only the longest ORF per sequence (without a stop codon). It then creates the two files needed for the submission: the fasta file with the trimmed sequences and the 5-column file needed to submit using BankIt. Of course, some final manual revision is needed, for example adding 'mitochondrial' before the name of proteins that end up in the mitochondria or reblasting (blastn) sequences that gave no protein and could be ribosomal RNAs.

Now for the questions:

  1. Is this approach sound?
  2. What would you suggest instead as a way of automating the process?
  3. What should I be especially careful about?

Thanks in advance for your enlightening suggestions!

EDIT: Here is some more information about how I choose the 'best' annotation. First, I need to highlight that this is only a provisional approach with some arbitrary decisions.

  1. I take swissprot's best annotation
    (e-value based)
  2. I add nr's five best annotations
    in order, from left to right (e-value based)
  3. I give a penalty to the e-values when I find certain keywords.
  4. I take the leftmost highest e-value as the best annotation

Here is the list of penalties used:

penalty = dict([("unnamed", 200),
                ("unknown", 30),
                ("uncharacterized", 30),
                ("hypothetical", 20),
                ("predicted", 10),
                ("similar to", 5),
                ("novel protein", 5)])

The e-values are multiplied by 10 * exp(penalty), for each of the penalized keywords found in the gene name, replacing e-values of zero by 1e-200.

This approach helps me avoid some of problems suggested by Keith James. Mis-annotations and orthologs are rather difficults problems given the fact that I work with a lowly studied fish species distant from model species. 'Tarpits' and non-specific annotation are avoided as much as possible by using annotations that do not contain 'UNNAMED' or too many unwanted keywords.

Cheers

• 4.5k views
ADD COMMENT
0
Entering edit mode

Thanks for expanding on your method, Eric. It's interesting to see the details.

ADD REPLY
4
Entering edit mode
13.9 years ago

A lot depends on the step where you 'find the best annotation'. Perhaps you could expand on how you intend to do this? The method potentially needs to be quite sophisticated. For example, simply taking the top blast hit is unlikely to yield acceptable results. Even top reciprocal best Fasta matches should be treated with some caution.The best systems that I'm aware of have always used multiple lines of evidence; reciprocal pairwise alignments, blastx, Pfam etc.

I'm assuming that you haven't found anything suitable in the literature? I spent a long time doing computationally-assisted annotation of cosmid clones and whole genomes. This was several years ago now, so I'm too out of touch with current research in that field to recommend one. It might be worth looking at the Fantom papers (automated annotation of RIKEN mouse cDNAs) for inspiration, even though the Fantom servers are no longer active.

Some of the potential sources of error that I recall are

  • Transitive errors i.e. picking up mis-annotations, especially through nr, and using them. As an extreme example, you can find bacterial proteins annotated as 'mitochondrial' because there was a hit to a mitochondrial protein.

  • Error 'tarpits' i.e. when a genome in nr is very closely related to yours, but has shockingly bad automatic annotation, causing all your searches hit these proteins, at least until Swissprot clean up the translations

  • Confusion between orthologs and paralogs. It will probably require the sort of contextual information only present in a completely finished genome to distinguish these. Avoid transferring gene symbols if you only have cDNAs because you can't be sure that you have the ortholog (unless you have other supporting evidence).

  • Failing to detect when essential domains or parts of domains are missing

  • Transferring non-specfic annotation, leading to proteins annotated as 'hypothetical predicted', 'hypothetical hypothetical', 'predicted putative' etc.

I would consider finding some high quality annotation from related species to use as a preferred reference. As controls I would take known good annotation and known "problem" data e.g. partial genes and pass them through the system.

ADD COMMENT
0
Entering edit mode

Thanks Keith. Many important points and suggestions. I'll take time later today to add some info to my question to clarify how I choose the 'best' annotation.

ADD REPLY
2
Entering edit mode
13.9 years ago
Ketil 4.1k

I'll upvote Keith's answer, which I think covers most of the bases.

I'll just add that the longest ORF is not necessarily the right one, and that for annotation, you will often get matches to some conserved domain, but where the rest of the protein is completely different.¹

The bottom line is that unless you are doing more advanced annotation (ideally experimental verification), I think it's just as well to submit cDNA reads without annotating them. If somebody wants to know what the reads hit when blast'ed against a database, it's easy enough for them to do themselves - and likely, the databases will have improved in the mean time.

¹ I once wrote a program to derive ORFs based on combining BLAST hits using dynamic programming: http://hackage.haskell.org/package/korfu. This works pretty well, since even for distantly related proteins, BLAST will pick up the weakly conserved bits.

ADD COMMENT
0
Entering edit mode

Hi Ketil. From the answer I got from Genbank, I don't think I am allowed to submit unannotated reads. Plus, it is really annoying when someone blasts on one of those 'UNKNOWN' protein from a similar species. Since this is our our species of word and there is little information, we are the most likely users in the future :) I agree that I have to refine the ORF choice, maybe doing like your approach of including the information about a few among the best blastx hits. Cheers!

ADD REPLY
0
Entering edit mode

Hi Ketil. From the answer I got from Genbank, I don't think I am allowed to submit unannotated reads. Plus, it is really annoying when someone blasts on one of those 'UNKNOWN' protein from a similar species. Since this is our species of work and there is little information for it, we are the most likely users in the near future :) I agree that I have to refine the ORF choice, maybe doing like your approach of including the information about a few among the best blastx hits. Cheers!

ADD REPLY

Login before adding your answer.

Traffic: 2463 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6