I blastx-ed 1Mbp bacterial genome fragment against NCBI nr database. I have split it into 2000bp fragments with 500bp overlap into a one multiple fasta file (splitter from EMBOSS)
splitter -sequence my_contig.fa -size 2000 -overlap 500
As on output I picked tabulated blast (-m 9
).
Next step was to convert blastx output into gff3. Got that one, with absolute positions (positions in intact contig).
Seems that often one ORF / predicted gene is covered by 2-3 blast hits to the same protein. Hits may or may not overlap. Hence my questions:
- what are the fragment sizes / overlaps typically used for blastx in such situation?
- are there any advantages of improving blast hits, by say merging overlapping segments (e-scores will be invalid), or by using blast2 (blastx mode) and comparing DNA sequence from region of overlapping/almost-touching hits against already detected protein?
Seems that I am missing hits to some fragments, therefore I will have to go down in fragment size and increase the proportion of the overlap. Average predicted gene size is 274 aa, so I will try 1kb fragments with 500bp overlaps next.