Question

How to get promoter coordinates of hg19 from UCSC genome browser ?

2

Entering edit mode

9.2 years ago

jack ▴ 980

Hi all,

I need to get Promoter coordinates of all genes in human genome from hg19 assembly.

Is it possible to get it from UCSC table? I tried, but I was not successful.

Would someone can help me with that?

gene Assembly genomics • 10k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by jack ▴ 980

Ram · Answer 1 · 2015-09-21

This should be a simple question, but in reality there are many approaches because there are multiple definitions of promoters.

The simplest way to do it is to go the the HG19 folder of the UCSC FTP site and download the upstream1000.fa.gz file, containing the sequence of the promoters for all the human genes.

If you are familiar with R you can do it using the brand new AnnotationHub interface from BioConductor. For more information, follow the tutorial here. In particular this code is based on this video.

> source("http://bioconductor.org/biocLite.R")
> biocLite("GenomicRanges")
> biocLite("AnnotationHub")
> biocLite("rtracklayer")
> library("GenomicRanges")
> library("AnnotationHub")
>
> qhs = query(ahub, c("RefSeq", "Homo sapiens", "hg19"))
> genes = qhs[[1]]
> proms = promoters(genes)

UCSC track 'refGene'
UCSCData object with 50066 ranges and 5 metadata columns:
                       seqnames               ranges strand   |         name     score     itemRgb                thick
                          <Rle>            <IRanges>  <Rle>   |  <character> <numeric> <character>            <IRanges>
      [1]                  chr1 [66997825, 67000024]      +   |    NM_032291         0        <NA> [67000042, 67208778]
      [2]                  chr1 [ 8376145,  8378344]      +   | NM_001080397         0        <NA> [ 8378169,  8404073]
      [3]                  chr1 [50489427, 50491626]      -   |    NM_032785         0        <NA> [48999845, 50489468]
      [4]                  chr1 [16765167, 16767366]      +   | NM_001145277         0        <NA> [16767257, 16785491]
      [5]                  chr1 [16765167, 16767366]      +   | NM_001145278         0        <NA> [16767257, 16785385]
      ...                   ...                  ...    ... ...          ...       ...         ...                  ...
  [50062] chr19_gl000209_random     [ 55209,  57408]      +   |    NM_002255         0        <NA>     [ 57249,  67717]
  [50063] chr19_gl000209_random     [ 44646,  46845]      +   | NM_001258383         0        <NA>     [ 57132,  67717]
  [50064] chr19_gl000209_random     [ 96135,  98334]      +   |    NM_012313         0        <NA>     [ 98146, 112480]
  [50065] chr19_gl000209_random     [ 68071,  70270]      +   | NM_001083539         0        <NA>     [ 70108,  83979]
  [50066] chr19_gl000209_random     [129433, 131632]      +   |    NM_012312         0        <NA>     [131468, 145120]
                                                    blocks
                                             <IRangesList>
      [1] [    1,   227] [91706, 91769] [98929, 98953] ...
      [2]       [   1,  102] [6222, 6642] [7214, 7306] ...
      [3]       [   1, 1439] [2036, 2062] [6788, 6884] ...
      [4]       [   1,  182] [2961, 3061] [7199, 7303] ...
      [5]       [   1,  104] [2961, 3061] [7199, 7303] ...
      ...                                              ...
  [50062]       [   1,   80] [ 280,  315] [1182, 1466] ...
  [50063] [    1,    86] [10414, 10643] [10843, 10878] ...
  [50064]       [   1,   46] [1523, 1557] [4002, 4301] ...
  [50065]       [   1,   71] [1071, 1106] [1851, 2135] ...
  [50066]       [   1,   69] [ 862,  897] [3334, 3633] ...

score 3 · Answer 2 · 2015-09-21

3

Entering edit mode

9.2 years ago

Chirag Nepal ★ 2.4k

You should be able to download promoter table from UCSC browser. Alternative you can download the gene coordinates in .bed format. Define upstream and downstream region in your assigned promoter region, which is generally 500 or 1000 nucleotides.

upsteam=500

downstream=500

cat ucscRefseq.bed | awk '{ if ($6 == "+") { print $1,$2-'$upstream', $2+'$downstream', $4, $5, $6,$7,$8,$9,$10,$11,$12 } else if ($6 == "-") { print $1, $3-'$upstream', $3+'$downstream', $4,$5,$6,$7,$8,$9,$10,$11,$12 }}' > promoter.bed

ADD COMMENT • link 7.6 years ago by Chirag Nepal ★ 2.4k

0

Entering edit mode

Trying the awk script and getting the following error:

awk: cmd. line:1:
^ unexpected newline or end of string

Any ideas? Thanks!

ADD REPLY • link 7.6 years ago by rbronste ▴ 420

0

Entering edit mode

There was one closing bracket missing, edited it, try it now.

ADD REPLY • link 7.6 years ago by Chirag Nepal ★ 2.4k