A lab mate just mentioned to me that he downloaded the ensembl gene list for homo sapiens, and was surprised to see that it is over 50000 different IDs! And this surprised me as well (moreover using Ensembl for a long time, hehe)
Both of had in mind a number around 23000-25000 genes that are contained in the human genome. So where does the rest of the IDs come from? What genes are they? Ok, these roughly 25000 are assumed to be protein coding, so all the rest are these RNAS of different flavour (tRNA, rRNA, snRNA etc.)?
So what do you know about this topic? Do you have good references? Please post here
The Ensembl BioMart has all the answers: Just choose "Ensembl Gene ID" and "Gene Biotype" as attributes, and you get a list of genes and their nature. A simple count on the second column of the resulting file gives this list:
Thanks! I should have thought of this solution :D The processed_transcript annotation seems weird to me. Why annotate as genes if they are transcripts? cheers, michi
Thanks! I should have thought of this solution :D The processed_transcript annotation seems weird to me. Are those observed transcripts, but we dont know the genes? cheers, michi
If you decide to use the EnsEMBL Perl API to retrieve any data, you can identify these gene categories using the gene object biotype method! E.g. $gene->biotype(); http://www.ensembl.org/info/docs/Doxygen/core-api/classBio11EnsEMBL1_1Gene.html#ae3e9096786ae5f6f59b276f313e7d471
If you decide to use the EnsEMBL Perl API to retrieve any data, you can identify these gene categories using the gene object biotype method! E.g. $gene->biotype(); http://www.ensembl.org/info/docs/Doxygen/core-api/classBio11EnsEMBL11Gene.html#ae3e9096786ae5f6f59b276f313e7d471
Sorry to jump in late. Michael's answer is perfect, and I voted it up one. As for the non-coding transcripts, those are imported into Ensembl from a manual annotation group (VEGA/Havana). They have a help page:
There is a whole section on processed transcripts. Some are non-coding, others appear to be based on limited EST evidence, so would be thought to be protein-coding.
I think your question is more relate to different alternative splicing isoforms. There are around 21.000 protein-coding genes in ensembl but if you consider as well all possible isoforms, not only the longest one, you'd get easily around 60.000 - 70.000 ids.
Would be useful to know exactly which file was downloaded so we can check the contents.
for the number in the title i just made a query only for attribute "Ensembl Gene ID" and hit count
Yes, it would be useful as the numbers I get directly from Ensembl are different than what Michael sees at BioMart.