Is anyone aware whether pre-ensembl information is available through a biomart query (I often use the biomaRt::biomaRt function in R to retrieve information)?
Specifically, I'm interested in using the biomaRt function to retrieve info on the crab-eating macaque, M fascicularis. But I've been unable to see a "preensembl_" attribute in the long list of attributes, and I'm unsure if it's either not a possibility or I'm simply looking in the wrong place.
This is listed as a pre-ensembl (Mfac5.0 pre-ensmbl) and info can be downloaded, etc, but I'm trying to avoid the extra processing of these files.
Another point regarding the 'biomaRt' function in the R package biomaRt; the code for accessing a particular mart ensemble is show in the snippet below; note that the ensemble specified in this case is for 'hsapiens_gene_ensembl'.
mart <- useMart("ensembl")
datasets <- listDatasets(mart)
mart <- useDataset("hsapiens_gene_ensembl",mart)
To the best of my knowledge, pre-ensembl datasets are not available; I'm wondering if they are available, but I'm using the wrong nomenclature (e.g., 'mfascicularis_gene_ensembl' should be 'mfac_gene_preensembl')?
As an alternative, I may just download the gff3 file from NCBI (Mfac5.0 gff) and use the makeTxDbFromGFF function in GenomicFeatures to make the TxDb object:
txdb <- makeTxDbFromGFF(file,
format=c("gff3"),
dataSource=NA,
organism=NA,
taxonomyId=NA,
circ_seqs=DEFAULT_CIRC_SEQS,
chrominfo=NULL,
miRBaseBuild=NA,
metadata=NULL,
dbxrefTag)
This works nicely, but unfortunately deviates from all of my other code for several other genomes I'm looking at.
> genes(txdb)
GRanges object with 32733 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
A1BG NC_022290.1 [ 58937416, 58951837] - | A1BG
A1CF NC_022280.1 [ 85376919, 85454067] + | A1CF
A2ML1 NC_022282.1 [ 9098456, 9153857] + | A2ML1
A3GALT2 NC_022272.1 [194630652, 194644767] + | A3GALT2
A4GALT NC_022281.1 [ 8344430, 8348390] + | A4GALT
... ... ... ... . ...
ZYG11A NC_022272.1 [174567376, 174637485] - | ZYG11A
ZYG11B NC_022272.1 [174660062, 174713613] - | ZYG11B
ZYX NC_022274.1 [176564635, 176574909] + | ZYX
ZZEF1 NC_022287.1 [ 3946682, 4101428] - | ZZEF1
ZZZ3 NC_022272.1 [149532207, 149664372] + | ZZZ3
-------
seqinfo: 655 sequences from an unspecified genome; no seqlengths
We don't have BioMart for genomes in pre. These are genomes that have not yet been fully processed and do not have the full file structure, which means we don't have the BioMart tables. Crab eating macaque is due to appear in the next Ensembl release, due in December.
I took a look at all available datasets under ensembl using
listDatasets(useMart("ensembl"))
, and found Macaca mulatta, but this is not the exact species you need.I have never heard of pre-ensembl datasets being made available through biomaRt. You may have to go down the manual route and build your own function in R that accepts M. fascicularis gene IDs and returns what ever you want to to return. I presume that you downloaded the GFT here: ftp://ftp.ensembl.org/pub/pre/gtf/macaca_fascicularis/ ?