Hi,
I'm trying to create a phylogenetic tree to describe the relationship between 8 different yeast species (listed below).
My preferred method for creating the tree is to align the same 6 gene sequences from each species using a programme like MEGA. I'm not able to use protein sequences, as I do not have this information for all of the species.
The issue I have is that some of the published annotations for these species aren't very thorough (ie. most annotated genes are called "hypothetical protein"). Therefore, I can't find the sequences for the 6 genes I'd like to align.
What methods can I use to isolate the sequences of my preferred genes?
I tried using blastn to query the sequence for COX1 from C. Lusitaniae against the M. pulcherrima genome, but my concern is that the sequence I pulled out may be truncated or incomplete.
My other worry is that by using a known gene from a species that's part of the analysis, I'm introducing a bias towards that species (ie. by searching for C. Lusitaniae's COX1, I'm going to find the closest thing to C. Lusitaniae's COX1 and not necessarily the real COX1 for that species).
Yeast Species:
- C. Lusitaniae
- C. Auris
- C. Albicans
- M. pulcherrima
- M. persimmonesis
- M. bicuspidata
- M. borealis
- M. orientalis
You could take whatever sequence data is available for these species and see if you have any of the
OrthoDB
/BUSCO
conserved sequences available in all data sets.BUSCO
should actually come in pretty handy here, as it can also score sequences on the basis of completeness. You could then just take whateverBUSCO
gene(s) is/are present in all species and use them for the phylogenetic analysis.Your best bet is to query NCBI Gene database with the name of the gene and species you are looking for. Here is one example.
Thanks for your response. So that's what I have tried but most of the species I'm working with (barring C. Lusitaniae, C. Auris and C. Albicans) haven't had these genes annotated and blast results come back with impartial matches.
If the annotations do not pre-exist in GenBank then this is not going to be straight forward. You may actually need to blast against individual genomes (which themselves may be incomplete) identify sequence of interest, extract and then create alignments.
It just depends on how much work/time you are willing to invest and quality of public data.