Question

Merging Blastx Hits From Overlapping Bacterial Genome Segments

2

Entering edit mode

14.7 years ago

Darked89 4.7k

I blastx-ed 1Mbp bacterial genome fragment against NCBI nr database. I have split it into 2000bp fragments with 500bp overlap into a one multiple fasta file (splitter from EMBOSS)

splitter -sequence my_contig.fa  -size 2000 -overlap 500

As on output I picked tabulated blast (-m 9).

Next step was to convert blastx output into gff3. Got that one, with absolute positions (positions in intact contig).

Seems that often one ORF / predicted gene is covered by 2-3 blast hits to the same protein. Hits may or may not overlap. Hence my questions:

what are the fragment sizes / overlaps typically used for blastx in such situation?
are there any advantages of improving blast hits, by say merging overlapping segments (e-scores will be invalid), or by using blast2 (blastx mode) and comparing DNA sequence from region of overlapping/almost-touching hits against already detected protein?

blast gff annotation genome bacteria • 4.1k views

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 14.7 years ago by Darked89 4.7k

score 3 · Answer 1 · 2010-03-05

3

Entering edit mode

14.7 years ago

Istvan Albert 102k

Isn't the size of the protein that causes multiple hits? No matter what fragment size or overlap you choose, if two or more fragments cover different sections of the same protein, you'll get mulitple hits.

If your fragment sizes are too large you'll miss regions, if they are too small you'll get multiple hits. This latter problem does not seem to preclude any downstream analysis, so it may not be worth trying to optimize it away.

ADD COMMENT • link 14.7 years ago by Istvan Albert 102k

0

Entering edit mode

Seems that I am missing hits to some fragments, therefore I will have to go down in fragment size and increase the proportion of the overlap. Average predicted gene size is 274 aa, so I will try 1kb fragments with 500bp overlaps next.

ADD REPLY • link 14.7 years ago by Darked89 4.7k