Easy is mostly relative to your skills, it is not trivial to do right though. Rule of thumb: the more conserved the protein sequence, the easier it becomes. For high sequence identities e.g. over 50% any approach is likely to work fine.
Easy: find orthologous sequences in fish species by using web-blast and extract the sequences of blast hits (given defined cut-offs) directly from GenBank. These will often be from predicted protein-coding sequences, which is fine if the annotation is at least half-reliable. This is all possible via the web-interface.
Easy: Extract orthologues via Ensembl's compara interface (gene tree) for the gene of interest. That will give you some orthologues, but only those in Ensembl
Intermediate: get the CDS instead of protein sequences or do the same on the commandline
Harder: if you want to include badly or unannoated genomes, you can use exonerate in est2genome or protein2genome mode. That may work well for well-conserved proteins. You can use your blast results to restrict contigs to search because exonerate alignment will be much slower than a blast run.
The following command will get you the CDS sequence:
exonerate --model protein2genome -q query.faa -t target.fna --refine full --ryo ">%qi vs. %ti %td %m aln.length: %qal score: %s %%-ident: %pi %%-sim: %ps strand: %tS CDS sequence \n%tcs"
Hardest: Annotate the unannotated genomes from scratch using MAKER2, BRAKER, etc. but I do not recommend it. The result for any single protein may not be better than the previous alternative.
I may need to learn the exonerate method, thank you very much for your suggestion, I have benefited a lot.This is very important to my subject :) I am going to learn it in the next few days, can I contact you on this forum in the future?
Always :)
I have used the exonerate method a lot, I can post some examples for extraction of protein sequences.
hey,Michael
I'm getting close to it with:
exonerate --model protein2genome query_aa.fa reference.fa
But I can't analyse the output file.can you give me some advices or operation example?
For a simple purpose, use a cds sequence of a protein gene I own, and the genome of the species of interest (unannotated) to find the protein gene in the species, and analyze its copy number and amino acid mutation.
I know the idea is simple and the operation may be complicated, as a beginner I have to learn, if you can help me I would appreciate it.
I am using the following command:
To speed up you could leave the --refine parameter out, also, only put the sequences for which you got tblastn hits into target.fna
When using your script, I got this error
1 ERROR was encountered in argument processing
[ 1 ] : Too many unflagged arguments
can you tell me where is the error?
Possibly it is a version problem? I am using exonerate version 2.4.0 on linux. I will try with a later version to reproduce.
it is an illegal indent because I typed an extra space. now,it running.thanks a lot.Looking forward to the outcome of the program
I just noticed that the output will be a bit bloated because it will also output the full alignment in text format. I like to have this for manual inspection of alignments and selecting the best one. But if you only want a FASTA style output, you might want to add
--showalignment FALSE --showvulgar FALSE
to the parameters.