Hi everyone
I have all the genomic sequences of Hepatitis B Virus listed in Genbank. I am trying to reannotate the four genes (X, polymerase, surface, precore-core) myself as the genbank records are often incorrectly annotated. I have extracted the type sequences of the four genes from all the type isolates using Biopython, and am trying to align them against all the genomic sequences so that I can extract the positions and reannotate my little database using Biopython again.
My problem lies with the splitting of the polymerase and precore-core genes. These two mostly start near the end or middle of the genomic sequences, and carry on from the beginning. So when I try to align the type sequences with the genomic, I only get half the gene. The programs I've tried from EMBOSS (needle, water, wordfinder) are obviously not geared towards circular type genomes. The genes are also in different reading frames, and all the genes overlap with one or more of the others, so I am doing this one gene at a time.
I was hoping someone could suggest a program or alternative method of reannotating a batch of sequences based on a type sequence. I have used the Genome Annotation Transfer Utility, but as far as I can see, this only does one sequence at a time, obviously not a good choice for 2500 genomic sequences.
Thanks Liam
hmm, yes, good idea. It does seem to work and there is not too much extra coding to filter out the extra sequence.
Neat workaround Ketil, I love this sort of solutions.