I have a csv file that I've downloaded from this paper here (abridged version below), and it's referring to CpG locations in the genome from beadChip array data, using assembly hg18 as follows:
CpGmarker Build Chr MapInfo SourceVersion TSS_Coordinate Gene_Strand Symbol Synonym Accession GID
cg00075967 36 15 72282407 36.1 72282245 - STRA6 PP14296; FLJ12541; NM_022369.2 GI:21314699
cg00374717 36 17 63814740 36.1 63815191 + ARSG KIAA1001; NM_014960.2 GI:45430056
cg00864867 36 12 78609399 36.1 78608921 - PAWR PAR4; Par-4; NM_002583.2 GI:55769532
...
Let's focus on the first entry for now, STRA6. I downloaded the Gencode gtf files (also for hg18 here), and the corresponding lines look like this (again, abridged for simplicity):
#bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts cdsEndStat exonFrames
1136 ENST00000395105 chr15 - 72258859 72282273 72259473 72281661 19 72258859,72260175,72260688,72261554,72261736, STRA6 cmpl cmpl
1136 ENST00000323940 chr15 - 72258859 72288424 72259473 72281661 19 72258859,72260175,72260688,72261554,72261736, STRA6 cmpl cmpl
The general locations line up pretty closely with the csv above, but am I being naive for thinking that the "TSS" should just be the furthest upstream position in the gtf? i.e. the lowest position within an exon (or the txStart) for genes on the + strand (highest for "-" strand ) ? It seems obvious, but I'm doubting myself because none of these values agree with the value that is provided for the "TSS" in the csv of the publication. I've also tried with RefSeq and a few others --nothing seems to match up precisely.
So am I reading the gtf files correctly to infer the TSS (is it just txStart for "+" strand and txEnd for "-"?), and if so, is there some other obvious reason why the TSS coordinates in the paper's csv file aren't the same?
Thanks for any help you can offer.
Thanks for the suggestion --I should have been more clear that both the csv _and_ the gtf were for hg18, so they are using the same assembly (but that's a good thing to check, I edited that in above.)
ok so then I'm not being naive. The TSS really is just the "furthest upstream" position included in the gene body (with the possible complication of multiple isoforms). That's helpful, thanks.
To be clear, picking the first base is a common definition of convenience, but one that may not always hold uniformly true. Introns may contain alternative TSSs, for instance, and genes can have several alternative promoters:
Ref: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000197
For ad-hoc analyses, using the first base of a canonical isoform to define a proximal promoter or otherwise start with at least one TSS for a given gene is probably fine.
If you're doing anything more complicated, however, particularly if it involves regulation, it may be worth looking at your data more deeply.