Hi all,
I'm having some problems with running TBLASTN locally that I was hoping someone here might have some ideas on what to do with.
I have a smallish fungal genome, and in that genome I have an open reading frame in a single exon forming a contigous 405 bp nucleotide sequence from start codon to stop codon. If I take this 405 bp nucleotide sequence and use blastn on the genome, I find the complete sequence. But if I convert the nucleotide sequence into an 134 amino acids long protein sequence (excluding the stop codon) and use tblastn on the same genome, the hit I get is only 113 amino acids long. I've tried several versions of BLAST from 2.2.31 to 2.7.1, and I've tweaked a number of parameters, for instance window_size, threshold and word_size, but nothing helps and I always get the same 113aa sequence.
Does anyone have any idea on what could cause this, and if is there anything I could try to improve the tblastn results? Alternatively, is there some other piece of software that I can try instead?
Cheers,
Jesper
apart from Joe 's insights and suggestions you could also play with the gap-open and gap-extensions cost parameters (set them both quite low and see what you get).
Out of curiosity , could you post the alignment of both blasts so we can inspect them?
Changing penalties for starting and opening gaps does not help either.
Here is both the nucleotide and the amino acid alignment:
Are you using an appropriate translation table? It’s possible that during the translation step, you’re getting a sequence which is less like the subject sequences due to incorrect amino acids?
It’s worth remembering too that BLAST (all variants) are local aligners, so there’s no guarantee you’d necessarily get back a full length sequence.
You could try relaxing your parameters for similarity, and then increase the culling limit to remove redundancy.
Alternatively, give something like DIAMOND or BLAT a go.
Hi, thanks for your suggestions.
I'm just using the standard amino acid translation table. I've now translated it again using a different program and I get the same sequence. Relaxing similarity parameters or changing culling does not help either.
I realize that BLAST is a local aligner and some edge effects are expected, but I think it's a bit disconcerting that it doesn't work here, in a situation that seem pretty close to a best case scenario. I can also mention that I've used TBLASTN to look for the same amino acid sequence in ~20 closely related strains and species, and while I get hits in most of them, in no case is the hit longer than 113aa.
I tried BLAT as well (DIAMOND is not set up to easily do protein-to-DNA searches), and BLAT actually finds the whole 134aa sequence with 100 identity. Maybe I will try using BLAT going forward with the project, but I still want to understand what is causing this issue. :)
Can you post the sequence of missing 23 AA?
There is not a unique reverse sequence translation from a protein. The same protein can be derived from a few (or many) dna sequences. Not sure how BLAST manages that... If it only converts it to one random sequence then it can happen that it doesn't find your original sequence.
As far as I understand, tblastn first convert the target sequence into amino acids in all six reading frames, and then uses searches these sequences.