I'm running BLAT with this command:
blat genome.fasta query.fasta blat_90.csv -minIdentity=90 -noHead
And I'm surprised that I've counted the results and order by count.
import pandas as pd
import sys
df_blat = pd.read_csv('blat_90.csv', index_col=False, sep='\t')
s = df_blat.qName.value_counts()
count = 1
for k in s.head(100).index.tolist():
count += 1
print(s[k],k)
The interesting thing is that the max results per query is 672 and this number is repeated about 30 times. Then I have queries that were repeated 671 times again many times.
672 MITE_T_71978|chr3D|340455280|340455443|AT|175|F2975
672 MITE_T_97305|chr7B|194283260|194283371|AT|29|F4298
672 MITE_T_110543|chr5D|34445518|34445608|TA|94|F5023
672 MITE_T_72475|chr3D|503092467|503092630|AT|174|F2988
...
So this makes me wonder if there's a limit in the number of times a query is search. I've repeated this script with no different results.
UPDATE: I've notice that running only one query with each wheat chromosome in a separate thread, the max results per search is 32.
This is how the results look like when I search only one sequence agains each chromosome at the time:
wc -l b*
32 b1A.csv
32 b1B.csv
32 b1D.csv
32 b2A.csv
32 b2B.csv
...
Max 32 hits per search (accordingly 32 x 21 chromosomes is 672)
There must be a limit in the code but I'm still not able to find it
Considering Jim Kent is author of blat (and how long blat has been around) it seems very unlikely that you are running into some odd bug in
blat
. Do you get a different number if you increase the tile size (-tileSize=20
)?Have you tried taking a sequence from human genome and testing it against that? I will try it out later today.
default tileSize of 11 is what I need in this case, I'll give a try (I do not understand the reason, and max is 18). Why a sequence from human? I'm using wheat genome. Also I did not know who Jim Kent is till now, but I guess any programmer could make mistakes.
Human was a test. Just something other than wheat in case that genome has some strange replicative characteristic you were picking up in this search.
changing the tileSize to 17 gives 3.240.017 results vs the 32 in the original command.
So 672 is not a hard limit in blat. You are observing some characteristic in your data. I am not sure what kind of data this is.
Indeed. Oddly a larger tileSize gives more result that default. It is the wheat genome, a large and repetitive one.
I was just thinking that too. Increasing the tile size should have given fewer results. This result would indicate that the increased tile size is allowing more initial seed alignments?
Could be, sounds like an undocumented feature (or an issue) since 3M results vs just 32 does not makes sense.
I've realized that specifing even default value (11) gives results as excected. Thanks for the advise.
@Juan: Is there still a question that is outstanding? I assume it is "the max results per search is 32".
Are you using the latest
blat
available? Did you get the precompiled executables or compiled it yourself?I've used the latest and compile it myself
https://users.soe.ucsc.edu/~kent/src/blatSrc35.zip
Indeed I've realized that specifying the tileSize does not solve the issue
Also tried with
http://hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.36.zip
and also got:
I am going to suggest that you either directly write to Jim Kent (you should be able to find his email at UCSC and as an aside you should know who he is) or post to the UCSC genome browser list about this problem.