Question

Where can I get a breakdown of RefSeq annotation statistics?

1

Entering edit mode

7.3 years ago

b10hazard ▴ 30

GenCode has a wonderful breakdown of the number of coding/non-coding genes and transcripts here...

http://www.gencodegenes.org/stats.html

Does a similar breakdown exist for RefSeq anywhere? Thanks!

refseq gencode annotation • 2.4k views

ADD COMMENT • link updated 7.3 years ago by GenoMax 151k • written 7.3 years ago by b10hazard ▴ 30

score 3 · Accepted Answer · 2018-02-21

3

Entering edit mode

7.3 years ago

GenoMax 151k

If you take a look at this hg19/GRCh37 GFF file from NCBI here is what is in there:

495245 CDS
   1 D_loop
33314 Genomic
   1 RNase_MRP_RNA
   1 RNase_P_RNA
   2 SRP_RNA
   4 Y_RNA
  22 antisense_RNA
19086 cDNA_match
618452 exon
30661 gene
6405 lnc_RNA
49861 mRNA
22621 match
3097 miRNA
  31 ncRNA
2046 primary_transcript
  23 rRNA
 297 region
   1 sequence_feature
  72 snRNA
 393 snoRNA
 698 tRNA
   1 telomerase_RNA
5732 transcript
   3 vault_RNA

For GRCh38 a similar file can be found here with following statistics:

1413848 CDS
   1 D_loop
33893 Genomic
  26 Genomic%2CXM/XP/XR
   1 RNase_MRP_RNA
   1 RNase_P_RNA
   2 SRP_RNA
  15 V_gene_segment
   4 Y_RNA
  22 antisense_RNA
13572 cDNA_match
  24 centromere
1856502 exon
43504 gene
28117 lnc_RNA
114575 mRNA
22834 match
3038 miRNA
  31 ncRNA
2025 primary_transcript
  23 rRNA
 558 region
 304 sequence_feature
  62 snRNA
 389 snoRNA
 629 tRNA
   1 telomerase_RNA
16011 transcript
   3 vault_RNA

ADD COMMENT • link 7.3 years ago by GenoMax 151k

0

Entering edit mode

Good answer - I was not aware that RefSeq had been compiling this info

ADD REPLY • link 7.3 years ago by Kevin Blighe 89k

0

Entering edit mode

This is not RefSeq but the files are from NCBI's human genome resource. Should be close enough.

ADD REPLY • link 7.3 years ago by GenoMax 151k

0

Entering edit mode

How did you parse that information from the GFF3 file?

ADD REPLY • link 7.3 years ago by b10hazard ▴ 30

1

Entering edit mode

cat interim_GRCh38.p10_top_level_2017-01-13.gff3 | awk '{print $3}' | sort | uniq -c

ADD REPLY • link 7.3 years ago by GenoMax 151k

0

Entering edit mode

That works! Thanks!

ADD REPLY • link 7.3 years ago by b10hazard ▴ 30

score 2 · Accepted Answer · 2018-02-21

2

Entering edit mode

7.3 years ago

Kevin Blighe 89k

I don't believe RefSeq do as good a breakdown of transcript types as GENCODE, however, you may be interested in the following resources:

Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction
NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy
A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification

Generally, I think that you'll find that GENCODE is more comprehensive for non-coding genes; however, for the majority of these, exact function is entirely unknown. Most people filter them out of, for example, RNA-seq experiments, in order to (in part) minimise the stringency of a false discovery rate threshold. On the other hand, RefSeq has the feel of a well-curated resource.

Kevin

ADD COMMENT • link 7.3 years ago by Kevin Blighe 89k

0

Entering edit mode

What I was really hoping to get was the number of full length non-coding transcripts that refseq has for hg19. Is there anyway to get this information?

ADD REPLY • link 7.3 years ago by b10hazard ▴ 30

0

Entering edit mode

The first link above is to a published manuscript where GENCODE transcripts were compared to those of RefSeq. They used GRCh37 / hg19 transcripts. in Additional File 3 of this is a table where they compare GENCODE to RefSeq NR, which according to RefSeq are non-protein coding transcripts (or transcripts unlikely to have protein coding potential).

Additional files can be found here: https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-16-S8-S2#MOESM6

ADD REPLY • link 7.3 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks for your help with this Kevin. Those papers were excellent resources and they covered a lot of ground. I accepted GenoMax's answer mostly because that output file provided a very flexible way for me extract the metrics I was looking for.

ADD REPLY • link 7.3 years ago by b10hazard ▴ 30

0

Entering edit mode

You are able to "accept" more than one answer so feel free to accept @Kevin's too.

ADD REPLY • link 7.3 years ago by GenoMax 151k

0

Entering edit mode

No problem, b10hazard - that's the nature of the game here. It's not a competition to see who can have the most accepted answers. I was actually about to say that you should accept the answer of GenoMax because it was a greater fit for your question. GenoMax is also much more experienced than I.

Thanks for the diplomacy GenoMax :)

ADD REPLY • link 7.3 years ago by Kevin Blighe 89k