There are many ways to do this. Apart from biomaRt, you can use the Mus musculus TxDb object:
> source("https://bioconductor.org/biocLite.R")
> biocLite('TxDb.Mmusculus.UCSC.mm10.ensGene')
> mygenes = c("ENSMUSG00000029847",
"ENSMUSG00000085236", "ENSMUSG00000063364","ENSMUSG00000085247",
"ENSMUSG00000072893","ENSMUSG00000018800","ENSMUSG00000020865",
"ENSMUSG00000023832","ENSMUSG00000045730","ENSMUSG00000007827",
"ENSMUSG00000021950","ENSMUSG00000071847","ENSMUSG00000025154",
"ENSMUSG00000047446","ENSMUSG00000026628","ENSMUSG00000029673")
> mygenes.transcripts = subset(transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene, columns=c("tx_id", "tx_name","gene_id")), gene_id %in% mygenes)
GRanges object with 62 ranges and 3 metadata columns:
seqnames ranges strand | tx_id tx_name gene_id
<Rle> <IRanges> <Rle> | <integer> <character> <CharacterList>
[1] chr1 [191170296, 191183340] - | 4925 ENSMUST00000027941 ENSMUSG00000026628
[2] chr1 [191171425, 191183108] - | 4926 ENSMUST00000131854 ENSMUSG00000026628
[3] chr2 [135169573, 135215616] - | 13731 ENSMUST00000138303 ENSMUSG00000085247
[4] chr2 [150310935, 150362765] - | 13893 ENSMUST00000051153 ENSMUSG00000063364
[5] chr2 [150310937, 150362733] - | 13894 ENSMUST00000124945 ENSMUSG00000063364
... ... ... ... ... ... ... ...
[58] chr18 [62177817, 62179959] - | 86501 ENSMUST00000053640 ENSMUSG00000045730
[59] chr19 [41766588, 41802047] - | 88750 ENSMUST00000026150 ENSMUSG00000025154
[60] chr19 [41766591, 41802084] - | 88751 ENSMUST00000163265 ENSMUSG00000025154
[61] chr19 [41769800, 41781336] - | 88752 ENSMUST00000176266 ENSMUSG00000025154
[62] chr19 [41769994, 41802047] - | 88753 ENSMUST00000177495 ENSMUSG00000025154
This will create a GenomicRanges
object called mygenes.transcripts
, from which you can access the coordinates of all transcripts of each gene. If you want to avoid complications and prefer to have just one coordinate per gene, use genes()
instead of transcripts()
To get the TSS, just resize the object to one base:
> mygenes.tss = resize(mygenes.transcripts, width=1, fix='start')
could you provide the format, please
I am using this site: https://genome.ucsc.edu/cgi-bin/hgTables