Question

Alignment Of Split Orfs

2

Entering edit mode

14.1 years ago

Liam Thompson ▴ 140

Hi everyone

I have all the genomic sequences of Hepatitis B Virus listed in Genbank. I am trying to reannotate the four genes (X, polymerase, surface, precore-core) myself as the genbank records are often incorrectly annotated. I have extracted the type sequences of the four genes from all the type isolates using Biopython, and am trying to align them against all the genomic sequences so that I can extract the positions and reannotate my little database using Biopython again.

My problem lies with the splitting of the polymerase and precore-core genes. These two mostly start near the end or middle of the genomic sequences, and carry on from the beginning. So when I try to align the type sequences with the genomic, I only get half the gene. The programs I've tried from EMBOSS (needle, water, wordfinder) are obviously not geared towards circular type genomes. The genes are also in different reading frames, and all the genes overlap with one or more of the others, so I am doing this one gene at a time.

I was hoping someone could suggest a program or alternative method of reannotating a batch of sequences based on a type sequence. I have used the Genome Annotation Transfer Utility, but as far as I can see, this only does one sequence at a time, obviously not a good choice for 2500 genomic sequences.

Thanks Liam

multiple biopython split orf • 2.5k views

ADD COMMENT • link updated 13.7 years ago by Ketil 4.1k • written 14.1 years ago by Liam Thompson ▴ 140

score 3 · Answer 1 · 2010-10-22

3

Entering edit mode

14.1 years ago

Ketil 4.1k

Why can't you simply concatenate the genome with itself, and use that as the "genome" for aligning your genes? This will of course get you two copies of everything, but the genes spanning the breakpoint should show up in one piece.

ADD COMMENT • link 14.1 years ago by Ketil 4.1k

0

Entering edit mode

hmm, yes, good idea. It does seem to work and there is not too much extra coding to filter out the extra sequence.