Hi,
I am in the process of submitting a few (200) cDNA sequences to Genbank and the process is not exactly painless. Since there was a 'relatively' small number of sequences, these were annotated by hand and the 5-column table required for the BankIt submission also made by hand. As you could probably guess, there has been a few mistakes introduced in both format and content. In the near future, I may have to submit up to 10000 annotated cDNA sequences, so of course the manual approach doesn't scale up very well.
I am now writing a pipeline that would take for input a fasta file and two blastx result files (against swissprot and nr databases). It then finds the best annotation, trims and reverse-complements the sequences as needed, keeps only the longest ORF per sequence (without a stop codon). It then creates the two files needed for the submission: the fasta file with the trimmed sequences and the 5-column file needed to submit using BankIt. Of course, some final manual revision is needed, for example adding 'mitochondrial' before the name of proteins that end up in the mitochondria or reblasting (blastn) sequences that gave no protein and could be ribosomal RNAs.
Now for the questions:
- Is this approach sound?
- What would you suggest instead as a way of automating the process?
- What should I be especially careful about?
Thanks in advance for your enlightening suggestions!
EDIT: Here is some more information about how I choose the 'best' annotation. First, I need to highlight that this is only a provisional approach with some arbitrary decisions.
- I take swissprot's best annotation
(e-value based) - I add nr's five best annotations
in order, from left to right (e-value based) - I give a penalty to the e-values when I find certain keywords.
- I take the leftmost highest e-value as the best annotation
Here is the list of penalties used:
penalty = dict([("unnamed", 200),
("unknown", 30),
("uncharacterized", 30),
("hypothetical", 20),
("predicted", 10),
("similar to", 5),
("novel protein", 5)])
The e-values are multiplied by 10 * exp(penalty), for each of the penalized keywords found in the gene name, replacing e-values of zero by 1e-200.
This approach helps me avoid some of problems suggested by Keith James. Mis-annotations and orthologs are rather difficults problems given the fact that I work with a lowly studied fish species distant from model species. 'Tarpits' and non-specific annotation are avoided as much as possible by using annotations that do not contain 'UNNAMED' or too many unwanted keywords.
Cheers
Thanks for expanding on your method, Eric. It's interesting to see the details.