Forum:why drop out the protein coding genes ?
1
1
Entering edit mode
8.5 years ago
jimmy_zeng ▴ 90

I just look through the statistics for the GENCODE : http://www.gencodegenes.org/archive_stats.html

And I noticed that the number of the protein coding gene is one the decrease .

very interesting and really puzzle me .

Version 21 (June 2014 freeze, GRCh38) - Ensembl 77 --> 19881

Version 20 (April 2014 freeze, GRCh38) - Ensembl 76 --> 19942

Version 19 (July 2013 freeze, GRCh37) - Ensembl 74 --> 20345

Version 18 (April 2013 freeze, GRCh37) - Ensembl 73 --> 20318

So, Is it not clear to judge a gene to be a protein coding or not ?

Is there will be any tiny different among individuals ?

BTW, From HGNC, there are just 19003 protein coding genes :http://www.genenames.org/cgi-bin/statistics

protein ensembl gene gencode • 2.1k views
ADD COMMENT
7
Entering edit mode
8.5 years ago
Denise CS ★ 5.2k

That shows how genome annotation is a fluid and dynamic field and the need for constant re-assessment. We annotate our genes based on biological evidence. Perhaps the evidence for the annotation in release 74 was different from release 73? So we may have been able to annotate additional protein coding genes in e74 when comparing to e73? There could be other explanations too. Releases 77 and 76 above are in a different assembly than 73 and 74, GRCh38 versus GRCh37. Different assemblies can lead to some discrepancies (gaps no longer available, sequencing errors removed, etc). Again what was called protein coding in GRCh37 is no longer the case in GRCh38. I'm not surprised by the different numbers (whether they increase or decrease). There is a wide range of papers out there on the theme with some biological context/explanation. I'm not sure how HGNC got their numbers, maybe the list just the protein coding genes with official gene symbols? Will ask them right now :)

ADD COMMENT
2
Entering edit mode

As Denise wrote, there are many reasons for the changing number of genes. This is all down to different assemblies and different amount and type of information used for the annotation process. Also an important source of variation between resources is that different resources have different definitions of what a gene is.

ADD REPLY
1
Entering edit mode

I have heard from my HGNC colleagues. They list the genes that have got approved symbols only. This could explain some of the the discrepancy. I was told it's not uncommon to have discrepancies between different resources.

ADD REPLY
0
Entering edit mode

Thank you very much. I see.

ADD REPLY

Login before adding your answer.

Traffic: 1890 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6