Hello,
I am doing some analysis using a pipeline that links enhancers to their putative target promoters. The pipeline requires as input (among other data) a bed file containing transcript start and end coordinates for each gene. However, it expects each gene to be represented by a single transcript so it can link a 500bp region flanking that TSS to an enhancer.
I have downloaded refseq annotation from ucsc table browser but it obviously contains multiple transcripts for each gene. I am not sure what would be a reasonable criterion for choosing a single representative transcript for each gene. Are there any options in the ucsc table browser that I may use to narrow down the most 'representative' transcript for each gene? Or are there any other annotation databases that may be helpful for finding a consensus TSS for each gene?
Thanks
Edit: I noticed that ensembl biomart has a 'gene start' and 'gene end' attribute instead of tss and tts which gives the outermost tss and tts for each gene. I was wondering if it makes sense to use this and if there a way I can get this from the ucsc table browser? (don't want to use ensembl as i have been using refseq annotation for everything else)
If you are referring to human genome then take a look at the MANE project as a potential source of
single representative
(with caveats) transcript per gene.Thanks. I am working with the mouse genome. Will edit my original post to clarify.