Hello,
I have a question about RefSeq annotation reports, particularly the one for the purple urchin. The RefSeq Annotation Report indicates that there are 258,355 exons. However, the gff file for this assembly has 442,528 lines in which column three has the value of "exon." What would explain this discrepancy?
For example, the following code returns "442528."
wget -O - -o /dev/null https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7668/102/GCF_000002235.5_Spur_5.0/GCF_000002235.5_Spur_5.0_genomic.gff.gz | gunzip --stdout | awk '$3 == "exon"' | wc -l
I see that there is a note on the annotation report indicating that the counts do not include pseudogenes. There is also an additional note next to the exons row that states:
"Exons in mRNAs, misc_RNAs and ncRNAs of class lncRNA. Does not include tRNAs, rRNAs or ncRNAs of class other than lncRNA. Exons shared by multiple transcripts are counted once."
I'm not sure that this could account for the 184,173 exon difference. I am hoping to compute sequencing coverage statistics across exons.
RefSeq
version of the genome is in a different location but produces identical results:RefSeq annotation report says
Excluding just
tRNA
andrRNA
gets you to the closest number.Thanks. This was helpful. I realized what I probably want is just the exons in coding transcripts. The closest I was able to get to 245,575 was 246,202
I'm not sure if grep -v rRNA is appropriate because it would exclude too many cases. For example, this pre-rRNA-processing protein would be excluded
I believe the tRNAs and rRNAs can be filtered with:
I'm not sure how to distinguish between ncRNAs of class lncRNA and ncRNAs of class other than lncRNA.