Hi all,
I need to get Promoter coordinates of all genes in human genome from hg19 assembly.
Is it possible to get it from UCSC table? I tried, but I was not successful.
Would someone can help me with that?
Hi all,
I need to get Promoter coordinates of all genes in human genome from hg19 assembly.
Is it possible to get it from UCSC table? I tried, but I was not successful.
Would someone can help me with that?
This should be a simple question, but in reality there are many approaches because there are multiple definitions of promoters.
The simplest way to do it is to go the the HG19 folder of the UCSC FTP site and download the upstream1000.fa.gz file, containing the sequence of the promoters for all the human genes.
If you are familiar with R you can do it using the brand new AnnotationHub interface from BioConductor. For more information, follow the tutorial here. In particular this code is based on this video.
> source("http://bioconductor.org/biocLite.R")
> biocLite("GenomicRanges")
> biocLite("AnnotationHub")
> biocLite("rtracklayer")
> library("GenomicRanges")
> library("AnnotationHub")
>
> qhs = query(ahub, c("RefSeq", "Homo sapiens", "hg19"))
> genes = qhs[[1]]
> proms = promoters(genes)
UCSC track 'refGene'
UCSCData object with 50066 ranges and 5 metadata columns:
seqnames ranges strand | name score itemRgb thick
<Rle> <IRanges> <Rle> | <character> <numeric> <character> <IRanges>
[1] chr1 [66997825, 67000024] + | NM_032291 0 <NA> [67000042, 67208778]
[2] chr1 [ 8376145, 8378344] + | NM_001080397 0 <NA> [ 8378169, 8404073]
[3] chr1 [50489427, 50491626] - | NM_032785 0 <NA> [48999845, 50489468]
[4] chr1 [16765167, 16767366] + | NM_001145277 0 <NA> [16767257, 16785491]
[5] chr1 [16765167, 16767366] + | NM_001145278 0 <NA> [16767257, 16785385]
... ... ... ... ... ... ... ... ...
[50062] chr19_gl000209_random [ 55209, 57408] + | NM_002255 0 <NA> [ 57249, 67717]
[50063] chr19_gl000209_random [ 44646, 46845] + | NM_001258383 0 <NA> [ 57132, 67717]
[50064] chr19_gl000209_random [ 96135, 98334] + | NM_012313 0 <NA> [ 98146, 112480]
[50065] chr19_gl000209_random [ 68071, 70270] + | NM_001083539 0 <NA> [ 70108, 83979]
[50066] chr19_gl000209_random [129433, 131632] + | NM_012312 0 <NA> [131468, 145120]
blocks
<IRangesList>
[1] [ 1, 227] [91706, 91769] [98929, 98953] ...
[2] [ 1, 102] [6222, 6642] [7214, 7306] ...
[3] [ 1, 1439] [2036, 2062] [6788, 6884] ...
[4] [ 1, 182] [2961, 3061] [7199, 7303] ...
[5] [ 1, 104] [2961, 3061] [7199, 7303] ...
... ...
[50062] [ 1, 80] [ 280, 315] [1182, 1466] ...
[50063] [ 1, 86] [10414, 10643] [10843, 10878] ...
[50064] [ 1, 46] [1523, 1557] [4002, 4301] ...
[50065] [ 1, 71] [1071, 1106] [1851, 2135] ...
[50066] [ 1, 69] [ 862, 897] [3334, 3633] ...
You should be able to download promoter table from UCSC browser. Alternative you can download the gene coordinates in .bed format. Define upstream and downstream region in your assigned promoter region, which is generally 500 or 1000 nucleotides.
upsteam=500 downstream=500 cat ucscRefseq.bed | awk '{ if ($6 == "+") { print $1,$2-'$upstream', $2+'$downstream', $4, $5, $6,$7,$8,$9,$10,$11,$12 } else if ($6 == "-") { print $1, $3-'$upstream', $3+'$downstream', $4,$5,$6,$7,$8,$9,$10,$11,$12 }}' > promoter.bed
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Trying the awk script and getting the following error:
awk: cmd. line:1:
^ unexpected newline or end of string
Any ideas? Thanks!
There was one closing bracket missing, edited it, try it now.