First post here, so apologizing in advance for formatting issues and other mistakes/slips.
I am currently working on a project based on a certain believable hypothesis, but I feel like my knowledge in this area is lacking some fundamentals.
My goal is to determine (and later compare) a number of gene duplicates in genomes of some mammals for genes which code certain proteins homologous to those found in humans.
I have designed a following pipeline:
- Download human proteins fastas from uniprot.
- Download genomes from ncbi and make databases from them using makeblastdb.
- Run tblastn (e-value = 0.0001) for my set of proteins across all genomes.
- Analyze blast output for hits which meet certain criteria:
- Query coverage > 70%
- Distance between consequent hsps < 50000 bp and is not below 0 (I have accounted for frame sign)
- Same domains as in og protein (optional, not yet implemented)
- Construct a resulting table of gene duplicates number.
- Compare numbers and prove/disprove original hypothesis.
However, I am not completely sure in some of these steps, so here come my questions:
- Is it correct to count gene duplication events using this method (tblastn...)?
- If so, are my selected criteria correct?
- What is a way to test my criteria? Are there databases for gene duplications numbers, at least for a human?
- (Should have asked it at the start, but well) Are there any standard methods to do this more efficiently and scientifically correct?
Any other info related to general theory behind the subject is appreciated, as well as criticism.
Hi,
This is a bit out of my daily routines, but I stumbled over the UCSC RetroGenes track some time ago.
In summary, they align all known mRNAs against the genome and inspect those closer which have at least two distinct alignments.
I hope this might help a bit.
Best,
Michael
Is there any particular reason you want to work with the genomes? Why not using the (predicted) genes in stead, or are they not available?