How to get promoter coordinates of hg19 from UCSC genome browser ?
2
2
Entering edit mode
9.2 years ago
jack ▴ 980

Hi all,

I need to get Promoter coordinates of all genes in human genome from hg19 assembly.

Is it possible to get it from UCSC table? I tried, but I was not successful.

Would someone can help me with that?

gene Assembly genomics • 10k views
ADD COMMENT
12
Entering edit mode
9.2 years ago

This should be a simple question, but in reality there are many approaches because there are multiple definitions of promoters.

The simplest way to do it is to go the the HG19 folder of the UCSC FTP site and download the upstream1000.fa.gz file, containing the sequence of the promoters for all the human genes.

If you are familiar with R you can do it using the brand new AnnotationHub interface from BioConductor. For more information, follow the tutorial here. In particular this code is based on this video.

> source("http://bioconductor.org/biocLite.R")
> biocLite("GenomicRanges")
> biocLite("AnnotationHub")
> biocLite("rtracklayer")
> library("GenomicRanges")
> library("AnnotationHub")
>
> qhs = query(ahub, c("RefSeq", "Homo sapiens", "hg19"))
> genes = qhs[[1]]
> proms = promoters(genes)

UCSC track 'refGene'
UCSCData object with 50066 ranges and 5 metadata columns:
                       seqnames               ranges strand   |         name     score     itemRgb                thick
                          <Rle>            <IRanges>  <Rle>   |  <character> <numeric> <character>            <IRanges>
      [1]                  chr1 [66997825, 67000024]      +   |    NM_032291         0        <NA> [67000042, 67208778]
      [2]                  chr1 [ 8376145,  8378344]      +   | NM_001080397         0        <NA> [ 8378169,  8404073]
      [3]                  chr1 [50489427, 50491626]      -   |    NM_032785         0        <NA> [48999845, 50489468]
      [4]                  chr1 [16765167, 16767366]      +   | NM_001145277         0        <NA> [16767257, 16785491]
      [5]                  chr1 [16765167, 16767366]      +   | NM_001145278         0        <NA> [16767257, 16785385]
      ...                   ...                  ...    ... ...          ...       ...         ...                  ...
  [50062] chr19_gl000209_random     [ 55209,  57408]      +   |    NM_002255         0        <NA>     [ 57249,  67717]
  [50063] chr19_gl000209_random     [ 44646,  46845]      +   | NM_001258383         0        <NA>     [ 57132,  67717]
  [50064] chr19_gl000209_random     [ 96135,  98334]      +   |    NM_012313         0        <NA>     [ 98146, 112480]
  [50065] chr19_gl000209_random     [ 68071,  70270]      +   | NM_001083539         0        <NA>     [ 70108,  83979]
  [50066] chr19_gl000209_random     [129433, 131632]      +   |    NM_012312         0        <NA>     [131468, 145120]
                                                    blocks
                                             <IRangesList>
      [1] [    1,   227] [91706, 91769] [98929, 98953] ...
      [2]       [   1,  102] [6222, 6642] [7214, 7306] ...
      [3]       [   1, 1439] [2036, 2062] [6788, 6884] ...
      [4]       [   1,  182] [2961, 3061] [7199, 7303] ...
      [5]       [   1,  104] [2961, 3061] [7199, 7303] ...
      ...                                              ...
  [50062]       [   1,   80] [ 280,  315] [1182, 1466] ...
  [50063] [    1,    86] [10414, 10643] [10843, 10878] ...
  [50064]       [   1,   46] [1523, 1557] [4002, 4301] ...
  [50065]       [   1,   71] [1071, 1106] [1851, 2135] ...
  [50066]       [   1,   69] [ 862,  897] [3334, 3633] ...
ADD COMMENT
3
Entering edit mode
9.2 years ago
Chirag Nepal ★ 2.4k

You should be able to download promoter table from UCSC browser. Alternative you can download the gene coordinates in .bed format. Define upstream and downstream region in your assigned promoter region, which is generally 500 or 1000 nucleotides.

upsteam=500

downstream=500

cat ucscRefseq.bed | awk '{ if ($6 == "+") { print $1,$2-'$upstream', $2+'$downstream', $4, $5, $6,$7,$8,$9,$10,$11,$12 } else if ($6 == "-") { print $1, $3-'$upstream', $3+'$downstream', $4,$5,$6,$7,$8,$9,$10,$11,$12 }}' > promoter.bed
ADD COMMENT
0
Entering edit mode

Trying the awk script and getting the following error:

awk: cmd. line:1:
^ unexpected newline or end of string

Any ideas? Thanks!

ADD REPLY
0
Entering edit mode

There was one closing bracket missing, edited it, try it now.

ADD REPLY

Login before adding your answer.

Traffic: 2137 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6