I am working on a follow-up to a study done in 2006. In it, the authors attempted to identify "all homologs" of a gene (using blastp and tblastn) and they looked at its likely evolution. (The gene is very highly conserved across species). There were a number of interesting questions that the authors could not answer due to a lack of available sequences, which is the gap I am now trying to fill.
I started out with the goal of identifying as many homologs as I could find, this proved to be trivial when using blastp. As of right now using just blastp I have found 30% of the homologs the original paper found (as well as many more that they did not find in 2006). However, I am having worse luck with tblastn:
When I use tblastn against the representative genome database I often encounter a CPU limit error message. That being said, whenever I get results, these are also problematic. For example, the only human hit is the entirety of the chromosome the gene is located on rather than the gene itself. This is obviously problematic for alignment and tree construction purposes since I need to
- somehow reduce the total chromosome sequence to just the sequence of the gene, and
- I have no idea how to deal with splice sites. For the human this is obviously known, but this is not necessarily the for other homologs. Any ideas on how to solve this?
In the original paper the authors searched just about every database they could get their hands on, one of them being an EST database. I have tried using tblastn against the EST database on ncbi with a different set of difficulties associated:
- I get a lot of hits from the same organism (many of these are from different tissues), how can I make sure that only one hit per organism is returned/exclude duplicates?
- One of my fears is that some of these results might not be completely assembled. Should I attempt to do this by hand (the original authors approach)? are there any commonly used software tools for this?
- Finally, the organism a hit belongs to is not consistently annotated (as is the case with the protein nr database - where the organism name is in square brackets). Is there a way to download the sequences (ideally into excel) with a separate column just for the organism name?
Finally, I am not clear on how people usually deal with multiple sequence alignments of nucleotide and amino acid sequences. Is it standard practice to translate nucleotides into AA before doing an alignment? This would require mRNA data, otherwise, how do people deal with splice sites?
I'm looking for some ideas on how people usually deal with these types of situations, as well as any potential software tools that may help (e.g. I know that BLAST+ may solve my CPU limit problem - I just have not had the time to learn it yet and I'm not sure it is necessary at this point).\
Thank you for your help!
I’d maybe try using profile HMMs for searching if you’re interested in finding the more remote homologs too