I want to extract all genomic location of promoters for mm9 with the corresponding transcript id/gene id/symbol. However, I have found out that there are duplicates ranges and sometimes two promoters correspond to one gene.
mm9 = TxDb.Mmusculus.UCSC.mm9.knownGene
promoter<-promoters(mm9)
> head(promoter)
GRanges object with 6 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr1 [4795974, 4798173] + | 1 uc007afg.1
[2] chr1 [4845775, 4847974] + | 3 uc007afi.2
[3] chr1 [4846409, 4848608] + | 5 uc011whu.1
So, here are the duplicates ranges. What is the reason for having those duplicates ranges/promoters?
Then I need for each promoter a corresponding gene id/symbol
promoter = unique(promoter)
gene_id_promoter = select(mm9, keys=as.character(promoter$tx_id), columns = c("TXNAME","GENEID"), keytype = "TXID")
> head(gene_id_promoter)
TXID GENEID TXNAME
1 1 18777 uc007afg.1
2 3 21399 uc007afi.2
3 5 21399 uc011whu.1
4 6 108664 uc007afm.1
5 8 18387 uc007afo.1
6 10 18387 uc007afq.1
Different transcript of a gene have the same gene id. But how is it possible that one gene can have two promoters? It means basically that two promoters (uc007afi.2
, uc011whu.1
) correspond to one gene id (21399) and two different transcripts of the same gene. So, I took a look on my ranges again.
[2] chr1 [4845775, 4847974] + | 3 uc007afi.2
[3] chr1 [4846409, 4848608] + | 5 uc011whu.1
uc007afi.2
is in the range of uc011whu.1
. How can it be explained? I have two promoters corresponding to one gene and two transcripts but one is in the range of another one. The reason for that is the not exact definition of a promoter region, isn't it? What region should I take to define a promoter region for a gene 21399?
Actually, many genes do have alternative promoters. Your gene also seems to have one.