Where can I get a breakdown of RefSeq annotation statistics?
2
1
Entering edit mode
6.8 years ago
b10hazard ▴ 30

GenCode has a wonderful breakdown of the number of coding/non-coding genes and transcripts here...

http://www.gencodegenes.org/stats.html

Does a similar breakdown exist for RefSeq anywhere? Thanks!

refseq gencode annotation • 2.1k views
ADD COMMENT
3
Entering edit mode
6.8 years ago
GenoMax 147k

If you take a look at this hg19/GRCh37 GFF file from NCBI here is what is in there:

495245 CDS
   1 D_loop
33314 Genomic
   1 RNase_MRP_RNA
   1 RNase_P_RNA
   2 SRP_RNA
   4 Y_RNA
  22 antisense_RNA
19086 cDNA_match
618452 exon
30661 gene
6405 lnc_RNA
49861 mRNA
22621 match
3097 miRNA
  31 ncRNA
2046 primary_transcript
  23 rRNA
 297 region
   1 sequence_feature
  72 snRNA
 393 snoRNA
 698 tRNA
   1 telomerase_RNA
5732 transcript
   3 vault_RNA

For GRCh38 a similar file can be found here with following statistics:

1413848 CDS
   1 D_loop
33893 Genomic
  26 Genomic%2CXM/XP/XR
   1 RNase_MRP_RNA
   1 RNase_P_RNA
   2 SRP_RNA
  15 V_gene_segment
   4 Y_RNA
  22 antisense_RNA
13572 cDNA_match
  24 centromere
1856502 exon
43504 gene
28117 lnc_RNA
114575 mRNA
22834 match
3038 miRNA
  31 ncRNA
2025 primary_transcript
  23 rRNA
 558 region
 304 sequence_feature
  62 snRNA
 389 snoRNA
 629 tRNA
   1 telomerase_RNA
16011 transcript
   3 vault_RNA
ADD COMMENT
0
Entering edit mode

Good answer - I was not aware that RefSeq had been compiling this info

ADD REPLY
0
Entering edit mode

This is not RefSeq but the files are from NCBI's human genome resource. Should be close enough.

ADD REPLY
0
Entering edit mode

How did you parse that information from the GFF3 file?

ADD REPLY
1
Entering edit mode
cat interim_GRCh38.p10_top_level_2017-01-13.gff3 | awk '{print $3}' | sort | uniq -c
ADD REPLY
0
Entering edit mode

That works! Thanks!

ADD REPLY
2
Entering edit mode
6.8 years ago

I don't believe RefSeq do as good a breakdown of transcript types as GENCODE, however, you may be interested in the following resources:

Generally, I think that you'll find that GENCODE is more comprehensive for non-coding genes; however, for the majority of these, exact function is entirely unknown. Most people filter them out of, for example, RNA-seq experiments, in order to (in part) minimise the stringency of a false discovery rate threshold. On the other hand, RefSeq has the feel of a well-curated resource.

Kevin

ADD COMMENT
0
Entering edit mode

What I was really hoping to get was the number of full length non-coding transcripts that refseq has for hg19. Is there anyway to get this information?

ADD REPLY
0
Entering edit mode

The first link above is to a published manuscript where GENCODE transcripts were compared to those of RefSeq. They used GRCh37 / hg19 transcripts. in Additional File 3 of this is a table where they compare GENCODE to RefSeq NR, which according to RefSeq are non-protein coding transcripts (or transcripts unlikely to have protein coding potential).

Additional files can be found here: https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-16-S8-S2#MOESM6

ADD REPLY
0
Entering edit mode

Thanks for your help with this Kevin. Those papers were excellent resources and they covered a lot of ground. I accepted GenoMax's answer mostly because that output file provided a very flexible way for me extract the metrics I was looking for.

ADD REPLY
0
Entering edit mode

You are able to "accept" more than one answer so feel free to accept @Kevin's too.

ADD REPLY
0
Entering edit mode

No problem, b10hazard - that's the nature of the game here. It's not a competition to see who can have the most accepted answers. I was actually about to say that you should accept the answer of GenoMax because it was a greater fit for your question. GenoMax is also much more experienced than I.

Thanks for the diplomacy GenoMax :)

ADD REPLY

Login before adding your answer.

Traffic: 2542 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6