Hi!
I have been trying to understand the confusing world of genome annotations with the goal of consistently using the same version throughout all analyses for one project.
Initially I was using UCSC annotations by simply loading the corresponding library in R:
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
This is where various questions started appearing: I thought that these are Ensembl-style annotations to the Ensembl reference genome. I then, however, read somewhere that these are UCSC's annotations to the Enbsembl genome sequence...is this correct?
Also, I noticed that the above TxDb object is not based on the newest release, so I decided to look into creating my own TxDB objects. Since for the alignment of the data, I used Ensembl's GRCm38 release 99, I would like to use the same annotation version for downstream analyses. For this, I found two options - using Biomart or Ensembl as resource. Considering that the default host URL of Biomart is ensembl.org, I would expect the code below to yield the same result - is this correct?
# from Biomart
BiomartTxDb <- makeTxDbFromBiomart(biomart= "ensembl", dataset = "mmusculus_gene_ensembl")
# from Ensembl
EnsemblTxDb <- makeTxDbFromEnsembl(organism = "Mus musculus", release = 99)
One difference I noticed is that makeTxDbFromBiomart uses the most recent release (release 100) - is there an option to specify the release (I couldn't find one, maybe I overlooked something...)
I am also wondering about different naming conventions:
seqlevelsStyle(EnsemblTxDb) <- "UCSC"
If I change the naming of my EnsemblTxDb object to UCSC, would this simply change "1" to "ch1" without affecting the underlying positioning info? Or do also the coordinates get converted from 0-based to 1-based coordinates?
I find all of this very confusing and would like to get some input to be sure that I am understanding everything correctly.