Hello all,
I have a question regarding the reference transcriptome corresponding to the hg19 reference genome.
I am performing an RNA-seq analysis on the galaxy platform. In my analysis I aligned all samples to the hg19 reference genome using the reference provided on the UCSC download page. For the annotation I used the iGenomes UCSC hg19 gene annotation (https://usegalaxy.org/library_common/ldda_info?library_id=4ab3a886a95d362e&show_deleted=False&cntrller=library&folder_id=57764628a4cf79d3&use_panels=False&id=05ccf9e20303b392) and after alignment I continued with gene counting and DE-analysis. So far so good.
Then I wanted to collect transcript counts by the use of salmon. In order to do so, I have to provide salmon with a reference transcriptome. For this purpose I downloaded the refMrna.fa.gz file, again from the UCSC download page (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/)
After obtaining the transcript counts I wanted to find the gene symbols corresponding to each transcript ID (RefSeq ID). For this purpose I used the above mentioned gtf-file. I discovered that around 50% of the transcripts in the refMrna.fa.gz file do not have a corresponding Gene Symbol in the gtf-file.
I was able to get the missing gene symbols using the "AnnotationDbi" R package in combination with bioDB.net (https://biodbnet-abcc.ncifcrf.gov/db/db2db.php), but still had to add 2 gene symbols manually.
Since there seems to be a 1:1 correspondence between RefSeq IDs and Gene Symbols and the reference transcriptome is inside the hg19 path of the UCSC download page, I fail to understand why so many RefSeq IDs do not have a corresponding Gene Symbol in the gtf-file
- Does it have something to do with the fact that according to the download page the refMrna.fa.gz-file is updated weekly?
- Is the refMrna.fa.gz-file even the reference transcriptome corresponding the the hg19 reference genome?
- And if so, why are there so many transcript IDs inside the file, that I cannot map to Gene Symbols, by the use of the hg19 gtf-file or other means of annotation?
I am concerned with this topic since it leads to the following problems:
- I cannot directly compare gene counts obtained by alignment and gene counts inferred by salmon transcript counts, because the genes are not comprised of the same transcripts in the reference genome and in the reference transcriptome.
- For many of the genes obtained by salmon I cannot find a suitable GO-annotation for downstream analysis.
After extensive research online I could not find a plausible explanation for this issue. I would be really thankful for any help or someone pointing me to the right direction!
Thanks a lot, Stefan