At risk of possibly missing something fundamental about the UCSC browser and various annotations and thus looking like the foolish newb I am on Biostars, I will ask if anyone can shed light on a situation I encountered.
I am mapping ChIP-seq reads to Mouse RefSeq transcription start sites. As one source of the TSS annotations, I downloaded a list from the UCSC table browser, clicking the following options.
I then downloaded the list of TSSs. The first few lines are:
#name chrom strand txStart
NM_001008533 chr1 - 134199214
NM_001039510 chr1 - 134199214
NM_001282945 chr1 - 134199214
NM_175642 chr1 - 25067475
NM_207653 chr1 + 58713285
NM_009805 chr1 + 58713285
NM_008922 chr1 - 33453807
If you go to the very first index listed on chromosome 1, the index is actually the transcription termination site for the minus strand gene NM_001008533, as indicated by the direction of the arrows for this gene?
There are other examples in this list, enough that, coupled with the proximity of some of these examples to other actual TSS's and K-means clustering that I identified a whole group of genes based on mapping ChIP-seq signal to TSSs that were in fact TSSs. This was complicated by nearby genes, oriented tail to head on the same strand as those genes with the TTS annotated as a TSS.
I may be missing something. In any case the scenario and any potential clarification may prove useful to a bench scientist like myself faced with some data analysis tasks. Further, it may be a word of caution about the nature of genome annotation. There were few enough instances that a composite plot or metagene analysis for many thousands of genes looked as one would expect, but clustering identified what at first glance was an interesting group. This group of genes survived futher analysis aimed at filtering out potential artifactual causes of this group. It wasn't until I started looking at a number of the genes one at a time on the browser that I saw what I just described.
Thanks for the answers. I find this annotation annotation absurd.
Others new to this type of analysis may learn from how I overlooked this issue at first glance: Given that the factor I am looking at is largely found just downstream from TSS and because I restricted my analysis to a small region around the RefSeq TSS list, I was really only analyzing + strand genes. I discovered a class of genes by K-means that had the majority of the ChIP-seq signal upstream, rather than downstream, of the TSS. This turned out to be signal from nearby TSSs on the - strand mapping to TTSs also on the minus strand, that were actually present on my TSS list. This is basically transitive disaster, defined.