Question

53893 Genes In Ensembl?

4

Entering edit mode

13.4 years ago

Michi ▴ 990

Hello Biostars

A lab mate just mentioned to me that he downloaded the ensembl gene list for homo sapiens, and was surprised to see that it is over 50000 different IDs! And this surprised me as well (moreover using Ensembl for a long time, hehe)

Both of had in mind a number around 23000-25000 genes that are contained in the human genome. So where does the rest of the IDs come from? What genes are they? Ok, these roughly 25000 are assumed to be protein coding, so all the rest are these RNAS of different flavour (tRNA, rRNA, snRNA etc.)?

So what do you know about this topic? Do you have good references? Please post here

Cheers Michi

human genome ensembl • 5.3k views

ADD COMMENT • link updated 13.4 years ago by Giulietta - Ensembl Helpdesk ★ 1.2k • written 13.4 years ago by Michi ▴ 990

1

Entering edit mode

Would be useful to know exactly which file was downloaded so we can check the contents.

ADD REPLY • link 13.4 years ago by Neilfws 49k

1

Entering edit mode

for the number in the title i just made a query only for attribute "Ensembl Gene ID" and hit count

ADD REPLY • link 13.4 years ago by Michi ▴ 990

0

Entering edit mode

Yes, it would be useful as the numbers I get directly from Ensembl are different than what Michael sees at BioMart.

ADD REPLY • link 13.4 years ago by Larry_Parnell 16k

Ram · Answer 1 · 2011-07-06

19

Entering edit mode

13.4 years ago

Michael Kuhn 5.0k

The Ensembl BioMart has all the answers: Just choose "Ensembl Gene ID" and "Gene Biotype" as attributes, and you get a list of genes and their nature. A simple count on the second column of the resulting file gives this list:

21494 protein_coding
11966 pseudogene
9274 processed_transcript
1951 snRNA
1809 miRNA
1531 lincRNA
1523 snoRNA
1190 misc_RNA
 787 scRNA_pseudogene
 580 Mt_tRNA_pseudogene
 535 rRNA
 179 rRNA_pseudogene
 176 LRG_gene
 163 IG_V_gene
 151 IG_V_pseudogene
 128 tRNA_pseudogene
  83 IG_J_gene
  73 snoRNA_pseudogene
  73 snRNA_pseudogene
  66 TR_V_gene
  30 IG_D_gene
  26 polymorphic_pseudogene
  22 Mt_tRNA
  21 TR_V_pseudogene
  16 IG_C_gene
  15 miRNA_pseudogene
  13 TR_J_gene
   7 IG_C_pseudogene
   3 misc_RNA_pseudogene
   3 TR_C_gene
   3 IG_J_pseudogene
   2 Mt_rRNA

ADD COMMENT • link 13.4 years ago by Michael Kuhn 5.0k

2

Entering edit mode

You're right, this is sort of curious. we do know the gene, but we don't know the function of the product. the only def I could find: "A transcript for which no open reading frame has been identified and for which no other function has been determined." http://sequenceontology.org/wiki/index.php/Category:SO:0001503_!_processed_transcript

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 13.4 years ago by Michael Kuhn 5.0k

0

Entering edit mode

Thanks! I should have thought of this solution :D The processed_transcript annotation seems weird to me. Why annotate as genes if they are transcripts? cheers, michi

ADD REPLY • link 13.4 years ago by Michi ▴ 990

0

Entering edit mode

Thanks! I should have thought of this solution :D The processed_transcript annotation seems weird to me. Are those observed transcripts, but we dont know the genes? cheers, michi

ADD REPLY • link 13.4 years ago by Michi ▴ 990

0

Entering edit mode

If you decide to use the EnsEMBL Perl API to retrieve any data, you can identify these gene categories using the gene object biotype method! E.g. $gene->biotype(); http://www.ensembl.org/info/docs/Doxygen/core-api/classBio11EnsEMBL1_1Gene.html#ae3e9096786ae5f6f59b276f313e7d471

ADD REPLY • link 13.0 years ago by Steve Moss 2.3k

0

Entering edit mode

If you decide to use the EnsEMBL Perl API to retrieve any data, you can identify these gene categories using the gene object biotype method! E.g. $gene->biotype(); http://www.ensembl.org/info/docs/Doxygen/core-api/classBio11EnsEMBL11Gene.html#ae3e9096786ae5f6f59b276f313e7d471

ADD REPLY • link 13.0 years ago by Steve Moss 2.3k

score 8 · Answer 2 · 2011-07-06

8

Entering edit mode

13.4 years ago

Larry_Parnell 16k

From http://useast.ensembl.org/Homo_sapiens/Info/StatsTable?db=core

Gene counts:

Known protein-coding genes: 20,599

Novel protein-coding genes: 895

Pseudogenes: 14,012

RNA genes: 8,563

Immunoglobulin/T-cell receptor gene segments: 556

Gene exons: 631,122

Gene transcripts: 174,416

ADD COMMENT • link 13.4 years ago by Larry_Parnell 16k

0

Entering edit mode

thanks! unfortunately it doesnt let me mark it as second correct answer.

ADD REPLY • link 13.4 years ago by Michi ▴ 990

score 6 · Answer 3 · 2011-07-13

Sorry to jump in late. Michael's answer is perfect, and I voted it up one. As for the non-coding transcripts, those are imported into Ensembl from a manual annotation group (VEGA/Havana). They have a help page:

http://vega.sanger.ac.uk/info/about/gene_and_transcript_types.html

There is a whole section on processed transcripts. Some are non-coding, others appear to be based on limited EST evidence, so would be thought to be protein-coding.

Hope this helps.

score 0 · Answer 4 · 2011-07-06

0

Entering edit mode

13.4 years ago

scapella ▴ 390

I think your question is more relate to different alternative splicing isoforms. There are around 21.000 protein-coding genes in ensembl but if you consider as well all possible isoforms, not only the longest one, you'd get easily around 60.000 - 70.000 ids.