How Can Different 3 Reading Frames Have Similarity For The Same Sequence Using Blastx?
3
2
Entering edit mode
13.9 years ago
Worldalive ▴ 20

Hi, I have a DNA sequence ( it's about 388 bp), which I am comparing with GenBank sequences using Blastx. I understand that Blastx looks into all possible 6 reading frames when translating a DNA seq, but the outcome is puzzling me because it is showing that 3 different reading frames show similarity to the same protein (it's in a conserved region of a Peptidase M1 superfamily). Also, when I look closely at the alignments, the similarities ( in the 3 frames) occur within the same region. The similarity is approx 76% of maximum identity and an E-value of 2e-11 .

Is this "similarity" of my sequence, most likely due to chance?

There are 2 things that make me think this:

1) I am aware that my sequence is too short compared to the >1000bp of the M1 peptidase sequence in GenBank.

2) When I look at the reading frames of my translated sequence, there are stop codons spread throughout... or can this be due to errors in sequencing?

Thanks for any help!

blast alignment • 4.9k views
ADD COMMENT
0
Entering edit mode

Repeating this comment regarding use of BlastX with frame shift penalty(-w option): I've found an interesting discussion here. I wonder typically what frame shift penalty value(s) for BlastX can be generally used.

ADD REPLY
0
Entering edit mode

I bet the 3 reading frames are in the same direction, right?

ADD REPLY
4
Entering edit mode
13.9 years ago
Ketil 4.1k

This is probably too obvious, but if it is a low complexity or repeat region, this could happen. Normally LCRs are masked by BLAST, but perhaps you were using -F F?

ADD COMMENT
2
Entering edit mode
13.9 years ago
Marina Manrique ★ 1.3k

Errors in sequencing can cause indels that change the reading frame. It's frequent that the same nucleotide sequence has several Blast high-scoring segment pairs (HSPs) in different reading frames with the same reference protein. I'd like to know if your sequence comes from a 454 experiment. The typical errors in 454 usually cause frameshifts that could explain your situation. It would be useful too to see the blast result you get

ADD COMMENT
2
Entering edit mode
13.9 years ago

It is not just a low-complexity region that will give the result you describe, but any repetitive sequence. This becomes a problem when the repeat sequence is falsely incorporated into a gene model, thereby taking what should be annotated as a genomic repeat/low-complexity region and putting it into the protein database.

Try it yourself - take a human Alu sequence and run it against a protein db. I'm sure many of those hits are from bad gene models.

ADD COMMENT

Login before adding your answer.

Traffic: 3528 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6