I am trying to blast proteins on a single genome. So I build a nucleotid database of the genome and used tblastn to align the query proteins on the genome. I did not change any standard parameters.
tblastn -query protein.fasta -db genome.fasta
The protein.fasta file looks like this:
>EDR98569.1 putative toxin-antitoxin system, toxin component [Anaerostipes caccae DSM 14662]
MDLNYHDILVKILDVYRDCEVQSFPIDCYSILRHYGYRIFTYQNIRDINERLYQYCRNYS
EDAFRYGAKRIIAYDENKSPFRIRFSIMHELGHIMLGHSRECAYNEQQANFFASNILAPR
MAIHFAQCRNEDDVSSVFQISREAGSYAFQNYRLWKESAAREVSDVDEAMYRHFYHDERE
EFIYSIKPCMICGETIYNSSEDLCLHCRMEHIRRQHTPLYTSRNDRMLLQIEQQQLNNL
The alligment looks like this:
Query 1 MDLNYHDILVKILDVYRDCEVQSFPIDCYSILRHYGYRIFTYQNIRDINERLYQYCRNYS 60
+DLNYHDILVKILDVYRDCEVQSFPIDCYSILRHYGYRIFTYQNIRDINERLYQYCRNYS
Sbjct 37493 VDLNYHDILVKILDVYRDCEVQSFPIDCYSILRHYGYRIFTYQNIRDINERLYQYCRNYS 37672
Query 61 EDAFRYGAKRIIAYDENKSPFRIRFSIMHELGHIMLGHSRECAYNEQQANFFASNILAPR 120
EDAFRYGAKRIIAYDENKSPFRIRFSIMHELGHIMLGHSRECAYNEQQANFFASNILAPR
Sbjct 37673 EDAFRYGAKRIIAYDENKSPFRIRFSIMHELGHIMLGHSRECAYNEQQANFFASNILAPR 37852
Query 121 MAIHFAQCRNEDDVSSVFQISREAGSYAFQNYRLWKESAAREVSDVDEAMYRHFYHDERE 180
MAIHFAQCRNEDDVSSVFQISREAGSYAFQNYRLWKESAAREVSDVDEAMYRHFYHDERE
Sbjct 37853 MAIHFAQCRNEDDVSSVFQISREAGSYAFQNYRLWKESAAREVSDVDEAMYRHFYHDERE 38032
Query 181 EFIYSIKPCMICGETIYNSSEDLCLHCRMEHIRRQHTPLYTSRNDRM 227
EFIYSIKPCMICGETIYNSSEDLCLHCRMEHIRRQHTPLYTSRNDRM
Sbjct 38033 EFIYSIKPCMICGETIYNSSEDLCLHCRMEHIRRQHTPLYTSRNDRM 38173
The protein has an alternative start codon as you can see. And the querry is shortened by the last 13 AS
LLQIEQQQLNNL
So far so good this can be the expected behavior if not further matches could be made when extending the query. But if I get the genome: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/154/305/GCA_000154305.1_ASM15430v1 and look what is downstream from the alligment I get this sequence:
VDLNYHDILVKILDVYRDCEVQSFPIDCYSILRHYGYRIFTYQNIRDINERLYQYCRNYSEDAFRYGAKRIIAYDENKSPFRIRFSIMHELGHIMLGHSRECAYNEQQANFFASNILAPRMAIHFAQCRNEDDVSSVFQISREAGSYAFQNYRLWKESAAREVSDVDEAMYRHFYHDEREEFIYSIKPCMICGETIYNSSEDLCLHCRMEHIRRQHTPLYTSRNDRMLLQIEQQQLNNL*EH
As you can see blast could have extended the query and match all the 13 AS, hence the score would have improved. So my question is now:
Why is tblastn show this behavior?
How can I force tblastn to align the full querry? (-qcov_hsp_perc 100 is not an option as this just filters the output)
If you wanne try for your self you can use the webservice https://blast.ncbi.nlm.nih.gov/Blast.cgi and under Database the wgs db and use the taxid 411490. The result is the same.
Oh and has nothing to do with a specific length of the query. For example this protein is much longer and also is cut when aligned by 6 AS.
>EDR97547.1 repeat protein [Anaerostipes caccae DSM 14662]
MGGGLSRSENTENITETFLGADETIVPLVNKVSEKSLKKTKEGQTMLDVSLGNIRINQTG
ASGGGLSEAETELNPNGYYITGKTKSYNVVVAKGVKTDLIFDSVEIESNSTTQSCVIVSH
ADVTITLKGVNIWGCNYGTSHDTNGGAVLAKNGMDGFLTVQCEYADQEGHLCDDNCSTLI
ATGNVVHAGAIGSTISNVTTASECGFCNFRVKGGNLEVSGGTHVAGIGSACNSQVYAGGY
TKNIYISGGNIKATGTERGPGIGGGYGSDFDGLYITGGKVEARGGASAPGIGTSSGQDGT
YKLKNVHISGGDTIVIAIGDKSTKMPGIGSAYGNANVSNVTAEPDPGYQGYIQDGTSLED
YSFMEGTPFHEKTDIRVGRFYTKVYFGPFRDVNGIEDDTKEQIGANHVISKSGGKPFTEE
LLHHLTKVTGKQENGTNFPPEQLTLADLSELETINAAKTKGEIGDFPLTYTTPNGTKATV
TVYLRNDGEDAGGFDKENIKEQIGANDFTKETGGSPFTEEDIKHLGEVKGKGKEGSNISL
DDFSVDQEQFKKINEHKTQGKAGEFELTYSDAKGNKVTVTVTLAGEYDAITENPDTGEMI
KGKHIISKTGGDGFTKEQLKGLSMVKAVDKDGTEIPTEDLSFAEEEQIAAINAAKTAGKT
GDFPLTFRTPDNTTVTITVFLRDEGTDAAKEGQDDPFSVIGANHTIQPTKGEPFTEEQII
DLCQAKGKDKNLDNAKILVDESQLAVINKAKKDGKTGVFDLTFSLSDGNKATVKVTLTGD
HRVSFDPDGGDYQPETQTVKGGDCAEAPRDPAKEGYVFEGWYYIDEDGNEVKWDFKTPVH
SDVKLKAKWKEADRTETTAVPTTAKKPKQKKTVPEWEYKKRVRRKRVSRTGDERQILCLI
VMFGAALTGLAAGIRKKRR