Hi all,
Recently I am dealing with bunch of genes to design the appropriate primers.
However, it is still hard for me to obtain the homology information of the primers.
For example, I need to design a pair primers for one exon of the gene.
I firstly get all possible primers with predefined length, e.g. 18-30 bps, and then use blastall -p blastn
(or megablast
) with -e 1 -W 8
to determine whether the primers have homogenous seqs. However, for those >10000 primers, the blast out file was larger than 200M, which requires longer time to parse using Bio::SearchIO module. And sometimes even crash the memory. Moreover, blasting those primer seqs within 18-30 bps are danger because shorter seqs will sometimes fail due to unkown reasons.
Another method is to blast the whole exon regions with parameter -e 0.1 -W 11
, however, it will generate huge output and it will take long time to parse the blast file, and to determine whether the primer region falls into homologous part.
Till now, I have not obtained any good method to fix such problem.
If anyone experienced such issue, can you plz tell me how?
Thanks.
2014.9.2
Although we could firstly define those nts belong to repeat regions using repeatMasker,
and then use -F parameter in blast to neglect these regions, those repeat regions, however, will sometimes do not share too much homologous sequences.
This is the method that I can find now, but is not perfect.
Hope someone could provide some suggestions to better improve the results.
Thanks,
The
-outfmt
parameter can bring the outfile in tab format, however, I can not use Bioperl to parse the file to get to know which nts are identical to the input sequence.The tabular format will only give how many nts are identical, how many gaps have been inserted. Although generally we could know the query start~end and the hit start~end, the can not be exactly be used to show the identity of part of the input sequence, e.g. from position 20-38.
The tabular output has many options and you can customize it to output other fields. For example
qlen
produces the query length,length
produces the alignment length,qseq
shows the aligned part of the query sequence etc. that may or may not be sufficient for what you are looking forBut if you need all details of the alignment then it won't work.
Yes, all I want to know is which nts within the query sequence are exactly same to the hit seqs, which is same as the output of using Bio::SearchIO to get the identical positions,