How can I add the specie associated name to ensembls ID in a fasta file (cds) ?

Entering edit mode

6.8 years ago

dtejadamartinez ▴ 20

Hi,

I download the orthologues fasta file (cds) for one gene in Ensembl, but the fasta file just have the ensembls ID and not the specie associated name.

How can I add the specie associated name to ensembls ID in a fasta file ?

(I need do that for hundred of genes)

Thanks,

Ensembl • 2.3k views

ADD COMMENT • link updated 6.8 years ago by finswimmer 16k • written 6.8 years ago by dtejadamartinez ▴ 20

Entering edit mode

You can tell a lot of an Ensembl Gene ID as @Erin Lim pointed you to the Ensembl help page.

The ID will have a three letter code e.g. MUS for mouse (latin name is Mus musculus) for the BRAF orthologue in mouse: ENSMUSG00000002413.

So if you know the 3 letter code, you know the species name in your FASTA file. It's that easy. If you don't know what the 3 letter code means, check Ensembl stable ID prefixes.

ADD REPLY • link 6.8 years ago by Denise CS ★ 5.2k

Entering edit mode

https://useast.ensembl.org/Help/Faq?id=488

By species associated name, I assume you meant gene symbol? You can parse the GTF yourself or use services like BioMart.

ADD REPLY • link 6.8 years ago by Eric Lim ★ 2.2k

Entering edit mode

I think what dtejadamartinez wants is to get the names in orthologs file that one can download from Ensembl comparative genomics page. Here is one example. Click on Download orthologues button and then select fasta format.

dtejadamartinez : If you use one of the other formats you should be able to get species names.

ADD REPLY • link 6.8 years ago by GenoMax 151k

Entering edit mode

an example CDs and expected output would help better

ADD REPLY • link 6.8 years ago by cpad0112 21k

Entering edit mode

I am reasonably sure that this is a pre-formatted file pre-computed by ensembl. I have an example posted in my comment above.

ADD REPLY • link 6.8 years ago by GenoMax 151k

Entering edit mode

Thanks,

If in Ensembl I select another format (not FASTA) it doesn't retrieve the option to download the cds

ADD REPLY • link 6.8 years ago by dtejadamartinez ▴ 20

Entering edit mode

In that case you will need to get the Ensembl id's out of your fasta file. Use biomaRt package in R (or use BioMart on web) to get the species names. They will then need to be added back to the fasta file.

What are you planning to do with this file BTW?

ADD REPLY • link 6.8 years ago by GenoMax 151k

Entering edit mode

Thanks, then I will use biomart in R as you suggest.

I'm going to do positive selection analysis (dN/dS)

ADD REPLY • link 6.8 years ago by dtejadamartinez ▴ 20

Entering edit mode

6.8 years ago

finswimmer 16k

Time to summarize :)

Eric Lim showed what the ensembl ID tells us. Denise - Open Targets found the correct link to the list of ID prefixes (For some reason the link on this Help Site is wrong, I'v contacted Emily_Ensembl for this). And finally genomax give us a link to an example file one can use.

We need to create a file containing the species prefixes.

	ENSPFO Poecilia formosa (Amazon molly)
	ENSJJA Jaculus jaculus (Lesser Egyptian jerboa)
	ENSPCO Propithecus coquereli (Coquerel's sifaka)
	ENSNGA Nannospalax galili (Upper Galilee mountains blind mole rat)
	ENSMFA Macaca fascicularis (Crab-eating macaque)
	ENSMIC Microcebus murinus (Mouse Lemur)
	MGP_CAROLIEiJ_ Mus caroli (Ryukyu mouse)
	ENSFAL Ficedula albicollis (Flycatcher)
	ENSCLA Chinchilla lanigera (Long-tailed chinchilla)
	ENSPEM Peromyscus maniculatus bairdii (Northern American deer mouse)
	ENSTNI Tetraodon nigroviridis (Tetraodon)
	ENSMLU Myotis lucifugus (Microbat)
	ENSPPY Pongo abelii (Orangutan)
	ENS Homo sapiens (Human)
	ENSRBI Rhinopithecus bieti (Black snub-nosed monkey)
	ENSCAF Canis lupus familiaris (Dog)
	ENSTRU Takifugu rubripes (Fugu)
	ENSCAP Cavia aperea (Brazilian guinea pig)
	ENSGMO Gadus morhua (Cod)
	ENSPSI Pelodiscus sinensis (Chinese softshell turtle)
	ENSTBE Tupaia belangeri (Tree Shrew)
	ENSMAU Mesocricetus auratus (Golden Hamster)
	ENSCEL Caenorhabditis elegans (Caenorhabditis elegans)
	MGP_DBA2J_ Mus musculus (Mouse DBA/2J)
	ENSVPA Vicugna pacos (Alpaca)
	ENSSBO Saimiri boliviensis boliviensis (Bolivian squirrel monkey)
	ENSCGR Cricetulus griseus (Chinese hamster CHOK1GS)
	ENSTTR Tursiops truncatus (Dolphin)
	ENSLAF Loxodonta africana (Elephant)
	MGP_LPJ_ Mus musculus (Mouse LP/J)
	MGP_SPRETEiJ_ Mus spretus (Algerian mouse)
	ENSSTO Ictidomys tridecemlineatus (Squirrel)
	MGP_PahariEiJ_ Mus pahari (Shrew mouse)
	ENSCPO Cavia porcellus (Guinea Pig)
	ENSCJA Callithrix jacchus (Marmoset)
	ENSAPL Anas platyrhynchos (Duck)
	ENSHGLM Heterocephalus glaber (Naked mole-rat male)
	ENSPVA Pteropus vampyrus (Megabat)
	ENSTSY Carlito syrichta (Tarsier)
	ENSCSA Chlorocebus sabaeus (Vervet-AGM)
	ENSFCA Felis catus (Cat)
	ENSBTA Bos taurus (Cow)
	ENSSCE Saccharomyces cerevisiae (Saccharomyces cerevisiae)
	ENSMNE Macaca nemestrina (Pig-tailed macaque)
	ENSACA Anolis carolinensis (Anole lizard)
	MGP_AKRJ_ Mus musculus (Mouse AKR/J)
	ENSPAN Papio anubis (Olive baboon)
	ENSMPU Mustela putorius furo (Ferret)
	ENSHGLF Heterocephalus glaber (Naked mole-rat female)
	ENSMLE Mandrillus leucophaeus (Drill)
	ENSCHO Choloepus hoffmanni (Sloth)
	ENSRNO Rattus norvegicus (Rat)
	MGP_CASTEiJ_ Mus musculus castaneus (Mouse CAST/EiJ)
	ENSOGA Otolemur garnettii (Bushbaby)
	ENSOAN Ornithorhynchus anatinus (Platypus)
	ENSSSC Sus scrofa (Pig)
	ENSCAT Cercocebus atys (Sooty mangabey)
	ENSOPR Ochotona princeps (Pika)
	ENSORL Oryzias latipes (Medaka)
	ENSCAN Colobus angolensis palliatus (Angola colobus)
	ENSPCA Procavia capensis (Hyrax)
	ENSMGA Meleagris gallopavo (Turkey)
	ENSNLE Nomascus leucogenys (Gibbon)
	ENSAME Ailuropoda melanoleuca (Panda)
	MGP_CBAJ_ Mus musculus (Mouse CBA/J)
	ENSCSAV Ciona savignyi
	MGP_NZOHlLtJ_ Mus musculus (Mouse NZO/HlLtJ)
	ENSPPA Pan paniscus (Bonobo)
	ENSTGU Taeniopygia guttata (Zebra Finch)
	ENSAMX Astyanax mexicanus (Cave fish)
	MGP_C3HHeJ_ Mus musculus (Mouse C3H/HeJ)
	ENSGAL Gallus gallus (Chicken)
	ENSEEU Erinaceus europaeus (Hedgehog)
	ENSGAC Gasterosteus aculeatus (Stickleback)
	ENSDAR Danio rerio (Zebrafish)
	ENSDOR Dipodomys ordii (Kangaroo rat)
	ENSMEU Notamacropus eugenii (Wallaby)
	ENSPTR Pan troglodytes (Chimpanzee)
	MGP_FVBNJ_ Mus musculus (Mouse FVB/NJ)
	ENSMMU Macaca mulatta (Macaque)
	ENSECA Equus caballus (Horse)
	ENSOAR Ovis aries (Sheep)
	FB Drosophila melanogaster (Fruitfly)
	ENSONI Oreochromis niloticus (Tilapia)
	ENSGGO Gorilla gorilla gorilla (Gorilla)
	MGP_NODShiLtJ_ Mus musculus (Mouse NOD/ShiLtJ)
	ENSLOC Lepisosteus oculatus (Spotted gar)
	ENSFDA Fukomys damarensis (Damara mole rat)
	ENSOCU Oryctolagus cuniculus (Rabbit)
	ENSMOC Microtus ochrogaster (Prairie vole)
	ENSLAC Latimeria chalumnae (Coelacanth)
	ENSCCA Cebus capucinus imitator (Capuchin)
	ENSODE Octodon degus (Degu)
	ENSANA Aotus nancymaae (Ma's night monkey)
	MGP_WSBEiJ_ Mus musculus domesticus (Mouse WSB/EiJ)
	ENSMOD Monodelphis domestica (Opossum)
	ENSCIN Ciona intestinalis
	ENSDNO Dasypus novemcinctus (Armadillo)
	ENSSAR Sorex araneus (Shrew)
	MGP_BALBcJ_ Mus musculus (Mouse BALB/cJ)
	MGP_129S1SvImJ_ Mus musculus (Mouse 129S1/SvImJ)
	MGP_PWKPhJ_ Mus musculus musculus (Mouse PWK/PhJ)
	ENSCHI Capra hircus (Goat)
	MGP_AJ_ Mus musculus (Mouse A/J)
	MGP_C57BL6NJ_ Mus musculus (Mouse C57BL/6NJ)
	ENSXET Xenopus tropicalis (Xenopus)
	ENSXMA Xiphophorus maculatus (Platyfish)
	ENSMUS Mus musculus (Mouse)
	ENSRRO Rhinopithecus roxellana (Golden snub-nosed monkey)
	ENSSHA Sarcophilus harrisii (Tasmanian devil)
	ENSETE Echinops telfairi (Lesser hedgehog tenrec)
	ENSPMA Petromyzon marinus (Lamprey)
	ENSCGR Cricetulus griseus (Chinese hamster CriGri)

view raw prefixes.txt hosted with ❤ by GitHub

In your downloaded orthologues fasta file we have a look at the ID, which feauture type they have. In the linked example an ID looks like this:

>ENSTNIP00000017949

The last character before the digit is always a P - for protein.

Now we can modify the header by first read in our prefixes.txt, iterate over the fasta file and extract the species prefix in every header line (everything between > and P), lookup the prefix in our list and append the name to the line:

$ awk -F "\t" -v OFS="\t" 'FNR==NR {species[$1]=$2; next} {match($0, />(.+)P/, id); if (id[1] in species) {print $0, species[id[1]]} else {print}}' prefixes.txt ortho.fa > output.fa

In the output the header line now looks like this:

>ENSTNIP00000017949     Tetraodon nigroviridis (Tetraodon)

Good team play!

fin swimmer

ADD COMMENT • link 6.8 years ago by finswimmer 16k