Hello!
I was wondering if anyone had some input on the issue of finding variants that represent the total variation observed for a particular gene across species?
One idea that I have in mind is to utilize BLAST searches with certain interesting sequences as inputs. I would then attempt to filter this output based on some basic rules, for example e-value and maximal length difference. This I hope would yield a decent large set of all variants available in the database, on which clustering could be performed. Problem is, I do not know which database to utilize in this case, neither for protein sequences nor DNA sequences.
Are the refseq_ databases the way to go, or perhaps the non-redundant/nucleotide-collection database?
I have a vague idea on how to further restrict my searches based on taxonomy, but I'm not quite there yet seeing as which database to use is still unclear.
I would deeply appreciate your input!
Thank you!
I appreciate your input, but sadly it's bacterial sequences I'm interested in, which makes things a bit more tricky with horizontal gene transfer possible.
You can always do you own analysis but take a look at this as well: http://mbgd.genome.ad.jp/