GenCode has a wonderful breakdown of the number of coding/non-coding genes and transcripts here...
http://www.gencodegenes.org/stats.html
Does a similar breakdown exist for RefSeq anywhere? Thanks!
GenCode has a wonderful breakdown of the number of coding/non-coding genes and transcripts here...
http://www.gencodegenes.org/stats.html
Does a similar breakdown exist for RefSeq anywhere? Thanks!
If you take a look at this hg19/GRCh37 GFF file from NCBI here is what is in there:
495245 CDS
1 D_loop
33314 Genomic
1 RNase_MRP_RNA
1 RNase_P_RNA
2 SRP_RNA
4 Y_RNA
22 antisense_RNA
19086 cDNA_match
618452 exon
30661 gene
6405 lnc_RNA
49861 mRNA
22621 match
3097 miRNA
31 ncRNA
2046 primary_transcript
23 rRNA
297 region
1 sequence_feature
72 snRNA
393 snoRNA
698 tRNA
1 telomerase_RNA
5732 transcript
3 vault_RNA
For GRCh38 a similar file can be found here with following statistics:
1413848 CDS
1 D_loop
33893 Genomic
26 Genomic%2CXM/XP/XR
1 RNase_MRP_RNA
1 RNase_P_RNA
2 SRP_RNA
15 V_gene_segment
4 Y_RNA
22 antisense_RNA
13572 cDNA_match
24 centromere
1856502 exon
43504 gene
28117 lnc_RNA
114575 mRNA
22834 match
3038 miRNA
31 ncRNA
2025 primary_transcript
23 rRNA
558 region
304 sequence_feature
62 snRNA
389 snoRNA
629 tRNA
1 telomerase_RNA
16011 transcript
3 vault_RNA
I don't believe RefSeq do as good a breakdown of transcript types as GENCODE, however, you may be interested in the following resources:
Generally, I think that you'll find that GENCODE is more comprehensive for non-coding genes; however, for the majority of these, exact function is entirely unknown. Most people filter them out of, for example, RNA-seq experiments, in order to (in part) minimise the stringency of a false discovery rate threshold. On the other hand, RefSeq has the feel of a well-curated resource.
Kevin
The first link above is to a published manuscript where GENCODE transcripts were compared to those of RefSeq. They used GRCh37 / hg19 transcripts. in Additional File 3 of this is a table where they compare GENCODE to RefSeq NR, which according to RefSeq are non-protein coding transcripts (or transcripts unlikely to have protein coding potential).
Additional files can be found here: https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-16-S8-S2#MOESM6
No problem, b10hazard - that's the nature of the game here. It's not a competition to see who can have the most accepted answers. I was actually about to say that you should accept the answer of GenoMax because it was a greater fit for your question. GenoMax is also much more experienced than I.
Thanks for the diplomacy GenoMax :)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Good answer - I was not aware that RefSeq had been compiling this info
This is not RefSeq but the files are from NCBI's human genome resource. Should be close enough.
How did you parse that information from the GFF3 file?
That works! Thanks!