tblastn doesn't find the whole sequence
1
0
Entering edit mode
5.3 years ago
jsvedber ▴ 40

Hi all,

I'm having some problems with running TBLASTN locally that I was hoping someone here might have some ideas on what to do with.

I have a smallish fungal genome, and in that genome I have an open reading frame in a single exon forming a contigous 405 bp nucleotide sequence from start codon to stop codon. If I take this 405 bp nucleotide sequence and use blastn on the genome, I find the complete sequence. But if I convert the nucleotide sequence into an 134 amino acids long protein sequence (excluding the stop codon) and use tblastn on the same genome, the hit I get is only 113 amino acids long. I've tried several versions of BLAST from 2.2.31 to 2.7.1, and I've tweaked a number of parameters, for instance window_size, threshold and word_size, but nothing helps and I always get the same 113aa sequence.

Does anyone have any idea on what could cause this, and if is there anything I could try to improve the tblastn results? Alternatively, is there some other piece of software that I can try instead?

Cheers,

Jesper

alignment software error • 2.8k views
ADD COMMENT
1
Entering edit mode

apart from Joe 's insights and suggestions you could also play with the gap-open and gap-extensions cost parameters (set them both quite low and see what you get).

Out of curiosity , could you post the alignment of both blasts so we can inspect them?

ADD REPLY
0
Entering edit mode

Changing penalties for starting and opening gaps does not help either.

Here is both the nucleotide and the amino acid alignment:

ADD REPLY
0
Entering edit mode

Are you using an appropriate translation table? It’s possible that during the translation step, you’re getting a sequence which is less like the subject sequences due to incorrect amino acids?

It’s worth remembering too that BLAST (all variants) are local aligners, so there’s no guarantee you’d necessarily get back a full length sequence.

You could try relaxing your parameters for similarity, and then increase the culling limit to remove redundancy.

Alternatively, give something like DIAMOND or BLAT a go.

ADD REPLY
0
Entering edit mode

Hi, thanks for your suggestions.

I'm just using the standard amino acid translation table. I've now translated it again using a different program and I get the same sequence. Relaxing similarity parameters or changing culling does not help either.

I realize that BLAST is a local aligner and some edge effects are expected, but I think it's a bit disconcerting that it doesn't work here, in a situation that seem pretty close to a best case scenario. I can also mention that I've used TBLASTN to look for the same amino acid sequence in ~20 closely related strains and species, and while I get hits in most of them, in no case is the hit longer than 113aa.

I tried BLAT as well (DIAMOND is not set up to easily do protein-to-DNA searches), and BLAT actually finds the whole 134aa sequence with 100 identity. Maybe I will try using BLAT going forward with the project, but I still want to understand what is causing this issue. :)

ADD REPLY
0
Entering edit mode

Can you post the sequence of missing 23 AA?

ADD REPLY
0
Entering edit mode

There is not a unique reverse sequence translation from a protein. The same protein can be derived from a few (or many) dna sequences. Not sure how BLAST manages that... If it only converts it to one random sequence then it can happen that it doesn't find your original sequence. enter image description here

ADD REPLY
4
Entering edit mode

As far as I understand, tblastn first convert the target sequence into amino acids in all six reading frames, and then uses searches these sequences.

ADD REPLY
3
Entering edit mode
5.3 years ago
Mensur Dlakic ★ 28k

The stretch of prolines at the end of your protein sequence is compositionally biased. Have you tried the -filter none switch?

Here is your protein (translated from DNA):

>prot
MNNTTLPVPVPISCDGNPSVQWYVWLIASIAWCAVGCIWFWHRSETVWTW
LNGPIETFHTEQPMTRRSRITTACFVSVLVVALWPLEILISNIWSRVTHA
TNNQGTPKDQPRTQPAEEPQAAPPPPYDVAVAAA

Here is what the filtering program (SEG) says about your protein:

>prot

                                  1-113  MNNTTLPVPVPISCDGNPSVQWYVWLIASI
                                         AWCAVGCIWFWHRSETVWTWLNGPIETFHT
                                         EQPMTRRSRITTACFVSVLVVALWPLEILI
                                         SNIWSRVTHATNNQGTPKDQPRT
         qpaeepqaappppydvavaaa  114-134

In other words, residues 114-134 are of low complexity and will not be visible to TBLASTN unless you turn off the filtering.

ADD COMMENT
0
Entering edit mode

Yes, you are correct! When turning off filtering with the -seg no option, the whole 134aa sequence is returned. I was wondering if there was some low complexity filtering going on, but I didn't know blast would do that based on the amino acid sequence.

This was not super well documented, I have to say, and another thing to think about when using BLAST.

Anyway, thanks for the help! This was really bugging me. :)

ADD REPLY
1
Entering edit mode

jsvedber : Low complexity filtering is documented in BLAST FAQ.

ADD REPLY
0
Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted (green check mark on left). You can accept more than one if they work.

ADD REPLY
0
Entering edit mode

I don't seem to be able to either upvote Mensur's reply or mark it as accepted. Am I missing something?

ADD REPLY
0
Entering edit mode

Hmm. It is clickable links at the left of the answer. You can't click on either? You may need to have scripting turned on (if you use no script or some such extension).

Screen-Shot-2019-09-09-at-9-04-55-PM

ADD REPLY

Login before adding your answer.

Traffic: 1500 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6