Simple question - I need to create a GTF file to use in HTSeq-count that contains gene regions plus 3kb upstream. (Background: doing a MeDIP-seq experiment, want to look for differential methylation in genic and 3kb promoter regions using count based method like edgeR/DESeq).
I was planning on making one myself from the UCSC hg19 refFlat table. The refFlat table has gene coordinates, but I need to extend this 3kb upstream to capture promoter regions.
Column 3 contains the strand (+/-) and columns 4-5 contain the transcription start (txStart) and end (txEnd) positions.
If I want to capture 3kb upstream of the TSS, I was planning on adding 3000 to txStart, but is only for genes on the + strand, correct? If I want 3kb upstream of the TSS for genes on the - strand, should I add this 3000 to txEnd?
E.g. the DENND1B gene is on the - strand at chr1:197,473,879-197,744,623. However, looking at it in the browser, it's transcribed "right to left", so I presume I would want to add 3000 to the txEnd number, 197,744,623, even though this is really where transcription starts.
Am I thinking about this correctly?
@Pierre, what your example shows is that the answer to his question is "yes". Since txStart is always <= txEnd. You have to add 3000 to txEnd when strand == -.
oppps!!! . fixed. Thank you http://xkcd.com/745/
a bit alarming that you got 4 up-votes with an incorrect response. also, your last code-block with newStart and newEnd should take strand into account.
Yes, but, to be fair, Lindenbaum's cat writes java code in its spare time.
I love solutions that involve mysql queries - by the way this is not unlike a paper reviewing process, does anyone actually redo the analysis to make sure that every single command was right or just look at it and say ok that looks pretty sweet stuff
No he was asking 3 questions (3 question marks). And on the first he said: "If I want to capture 3kb upstream of the TSS, I was planning on adding 3000 to txStart, but is only for genes on the + strand, correct?"
This is wrong.
For a + gene adding to the TSS goes into the gene, not upstream of the gene. This is only correct for - genes.