Question

How To Get Gene Regions +3Kb Upstream

5

Entering edit mode

12.2 years ago

Stephen 2.8k

Simple question - I need to create a GTF file to use in HTSeq-count that contains gene regions plus 3kb upstream. (Background: doing a MeDIP-seq experiment, want to look for differential methylation in genic and 3kb promoter regions using count based method like edgeR/DESeq).

I was planning on making one myself from the UCSC hg19 refFlat table. The refFlat table has gene coordinates, but I need to extend this 3kb upstream to capture promoter regions.

Column 3 contains the strand (+/-) and columns 4-5 contain the transcription start (txStart) and end (txEnd) positions.

If I want to capture 3kb upstream of the TSS, I was planning on adding 3000 to txStart, but is only for genes on the + strand, correct? If I want 3kb upstream of the TSS for genes on the - strand, should I add this 3000 to txEnd?

E.g. the DENND1B gene is on the - strand at chr1:197,473,879-197,744,623. However, looking at it in the browser, it's transcribed "right to left", so I presume I would want to add 3000 to the txEnd number, 197,744,623, even though this is really where transcription starts.

Am I thinking about this correctly?

ucsc gtf • 10k views

ADD COMMENT • link updated 4.1 years ago by Biostar 20 • written 12.2 years ago by Stephen 2.8k

2

Entering edit mode

12.2 years ago

Ido Tamir 5.2k

Not completely if I followed your reasoning.

for upstream region only:
dir    column4               column5 
+      txstart - 3000        txstart  
-      txend                 txend + 3000  

for gene incl upstream region:
dir    column4               column5 
+      txstart - 3000        txend  
-      tstart                txend + 3000

by "adding 3000 to txStart, but is only for genes on the + strand" you get 3kb into the gene.

You have to subtract 3000 to get upstream of the transcription site for + genes.

Can one do tables with this editor? html?

ADD COMMENT • link updated 5.9 years ago by Ram 44k • written 12.2 years ago by Ido Tamir 5.2k

2

Entering edit mode

12.2 years ago

brentp 24k

UCSC always stores the data such that start <= end. So, yes, the actual transcription start is

txStart if strand == "+"

and

txEnd if strand == '-'.

ADD COMMENT • link updated 5.9 years ago by Ram 44k • written 12.2 years ago by brentp 24k

2

Entering edit mode

8.8 years ago

Louis ▴ 160

A solution in R, to answer the original question :

library('GenomicRanges')
library('TxDb.Hsapiens.UCSC.hg19.knownGene')

txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
genes <- as.data.frame(transcripts(txdb_hg19))
colnames(genes)[1:3] <- c("chrom","txStart","txEnd")
tss <- genes$txStart
n.str <- genes$strand == '-'
tss[n.str] <- genes$txEnd[n.str]
tss.us <- tss - 3000
tss.us[n.str] <- tss[n.str] + 3000
tes <- genes$txEnd
tes[n.str] <- genes$txStart[n.str]
genes$txStart <- tss
genes$txUs <- tss.us
genes$txEnd <- tes

# For querying against other genomic ranges make a GRanges object for the
# upstream regions (requires upstream and TSS coordinates in ascending order)

asc.coords <- transform(genes[c("txStart","txUs")],
            min = ifelse(n.str, txStart, txUs),
            max = ifelse(n.str, txUs, txStart))
min.coords <- asc.coords$min
max.coords <- asc.coords$max

tss.us.regions <- GRanges(seqnames = Rle(genes$chrom),
                 strand = Rle(genes$strand),
                 ranges = IRanges(min.coords, max.coords),
                 TxID = genes$tx_id) # [example metadata column]

ADD COMMENT • link updated 5.9 years ago by Ram 44k • written 8.8 years ago by Louis ▴ 160

Ram · Accepted Answer · 2012-09-26

12

Entering edit mode

12.2 years ago

Pierre Lindenbaum 164k

"If I want 3kb upstream of the TSS for genes on the - strand, should I add this 3000 to txEnd?"

yes.

the position are always ordered from the 5' of the chromosome to the 3' and named xxStart and xxEnd (with xxStart<=xxEnd) whatever the value of "strand".

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select count(*) from refFlat where txStart>txEnd or cdsStart>cdsEnd'
+----------+
| count(*) |
+----------+
|        0 |
+----------+

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select txStart,txEnd from refFlat where strand="-" limit 4'
+-----------+-----------+
| txStart   | txEnd     |
+-----------+-----------+
|  62929370 |  62937380 |
|     34610 |     36081 |
| 153270337 | 153283194 |
| 113235158 | 114449242 |
+-----------+-----------+

So you'll only have to :

newStart=(start<3000?0:start-3000);
newEnd=end+3000;

EDIT: I was too fast. My pseudo-code extends the sequence on both side. If you want to extend upstream sequence :

newStart=(strand=='+'?start<3000?0:start-3000:start);
newEnd=(strand=='+'?end:end+3000);

to fetch the (positive strand) sequence; See

How to get the sequence of a genomic region from UCSC?

and

Extract sequence from the genome?

ADD COMMENT • link updated 5.9 years ago by Ram 44k • written 12.2 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

@Pierre, what your example shows is that the answer to his question is "yes". Since txStart is always <= txEnd. You have to add 3000 to txEnd when strand == -.

ADD REPLY • link updated 5.9 years ago by Ram 44k • written 12.2 years ago by brentp 24k

1

Entering edit mode

oppps!!! . fixed. Thank you http://xkcd.com/745/

ADD REPLY • link updated 5.9 years ago by Ram 44k • written 12.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

a bit alarming that you got 4 up-votes with an incorrect response. also, your last code-block with newStart and newEnd should take strand into account.

ADD REPLY • link updated 5.9 years ago by Ram 44k • written 12.2 years ago by brentp 24k

10

Entering edit mode

Lindenbaum's Cat

ADD REPLY • link updated 5.9 years ago by Ram 44k • written 12.2 years ago by Istvan Albert 102k

6

Entering edit mode

Yes, but, to be fair, Lindenbaum's cat writes java code in its spare time.

ADD REPLY • link updated 5.9 years ago by Ram 44k • written 12.2 years ago by brentp 24k

0

Entering edit mode

I love solutions that involve mysql queries - by the way this is not unlike a paper reviewing process, does anyone actually redo the analysis to make sure that every single command was right or just look at it and say ok that looks pretty sweet stuff

ADD REPLY • link updated 5.9 years ago by Ram 44k • written 12.2 years ago by Istvan Albert 102k

0

Entering edit mode

No he was asking 3 questions (3 question marks). And on the first he said: "If I want to capture 3kb upstream of the TSS, I was planning on adding 3000 to txStart, but is only for genes on the + strand, correct?"

This is wrong.

For a + gene adding to the TSS goes into the gene, not upstream of the gene. This is only correct for - genes.

ADD REPLY • link updated 5.9 years ago by Ram 44k • written 12.2 years ago by Ido Tamir 5.2k