I have a collection of gene sequences, for each I would like to identify any highly similar 150bp regions within a specific assembled genome.
I have initially tried megablast, where I blasted each query gene against the genome and looked for regions with a sequence identity > 85%. However, it became apparent that even if this worked for most genes, it may sometimes prove that an aligned region of 200bp may have an identity of 84% to my query gene, whereas in reality had the alignment been shortened by 40bp, the identity would rise above 85%.
Tweaking the blast parameters such as gap penalties and the like is not a perfect solution, and searching biostars for this particular problem proved difficult.
Do you have any suggestions on how to proceed?
Thank you...
I want to make sure I understand the question: So you would like to take a query sequence longer than 150bp, and then try to find high percent identity matches to any subsequence 150bp long? So, then, there would be 51 possible subsequences of length 150 in a query that is 200 bp long.
I suppose I should go one step further. If that is the case, would querying against all possible subsequences solve the problem? Two short PERL or Python scripts would do the trick. One to make the new queries with informative query names, and then one to summarize all the results to find matches meeting the criteria. Sometimes, we ignore the ends of shotgun reads. You could trim off 25 bases from the front and back. That's sloppy, but fast.
You've summarized it accurately.
I did consider writing a python script for this comparison, but I fear that it will run incredibly slow. I might just time a subset and estimate how long it will run for.
Thank you for your input
In my opinion blastn is an ideal choice for this purpose. Have you considered bit score? It's gives fair idea of alignment length and sequence identity.
What I fear might happen is as I expressed in the question; that the settings can not be tweaked enough to ensure that all 150bp regions of all genes are checked against the query's 150bp regions put in.
'Fair idea' is unfortunately not good enough in this case.
Thank you for your input