Time to summarize :)
Eric Lim showed what the ensembl ID tells us. Denise - Open Targets found the correct link to the list of ID prefixes (For some reason the link on this Help Site is wrong, I'v contacted Emily_Ensembl for this). And finally genomax give us a link to an example file one can use.
We need to create a file containing the species prefixes.
In your downloaded orthologues fasta
file we have a look at the ID
, which feauture type they have. In the linked example an ID looks like this:
>ENSTNIP00000017949
The last character before the digit is always a P
- for protein.
Now we can modify the header by first read in our prefixes.txt
, iterate over the fasta
file and extract the species prefix in every header line (everything between >
and P
), lookup the prefix in our list and append the name to the line:
$ awk -F "\t" -v OFS="\t" 'FNR==NR {species[$1]=$2; next} {match($0, />(.+)P/, id); if (id[1] in species) {print $0, species[id[1]]} else {print}}' prefixes.txt ortho.fa > output.fa
In the output the header line now looks like this:
>ENSTNIP00000017949 Tetraodon nigroviridis (Tetraodon)
Good team play!
fin swimmer
You can tell a lot of an Ensembl Gene ID as @Erin Lim pointed you to the Ensembl help page.
The ID will have a three letter code e.g. MUS for mouse (latin name is Mus musculus) for the BRAF orthologue in mouse: ENSMUSG00000002413.
So if you know the 3 letter code, you know the species name in your FASTA file. It's that easy. If you don't know what the 3 letter code means, check Ensembl stable ID prefixes.
https://useast.ensembl.org/Help/Faq?id=488
By species associated name, I assume you meant gene symbol? You can parse the GTF yourself or use services like BioMart.
I think what dtejadamartinez wants is to get the names in orthologs file that one can download from Ensembl comparative genomics page. Here is one example. Click on
Download orthologues
button and then select fasta format.dtejadamartinez : If you use one of the other formats you should be able to get species names.
an example CDs and expected output would help better
I am reasonably sure that this is a pre-formatted file pre-computed by ensembl. I have an example posted in my comment above.
Thanks,
If in Ensembl I select another format (not FASTA) it doesn't retrieve the option to download the cds
In that case you will need to get the Ensembl id's out of your fasta file. Use
biomaRt
package in R (or use BioMart on web) to get the species names. They will then need to be added back to the fasta file.What are you planning to do with this file BTW?
Thanks, then I will use biomart in R as you suggest.
I'm going to do positive selection analysis (dN/dS)