Hi,
I'd like to create lists of transcription start sites based on the GTF files of the multiple annotation sets we have in our group. For example, for mm10 we have GENCODE annotation, RefSeq annotation, refGene etc. For each of them, I have a GTF/GFF file. Having a list of TSS regions is particularly helpful when doing ChIPseq analysis.
What I tried so far is the following protocol:
- reading the GTF file using
readGFF
(rtracklayer package) - define the TSS as the
start
position for each entry that is on the+
strand and as theend
position of every entry that is on the-
strand - get all unique TSS (per chromosome)
However, using the GENCODE annotation (very comprehensive) I end up with ~420k TSS (all) or ~350k TSS (protein coding transcripts). This is a bit too much, considering that there are ~50k unique genes in the list.
Do you have any recommendation for how to reduce the list? For example, I could take the first/last TSS for each gene, but I don't know what is a solid way to proceed here.
Any suggestion is appreciated. If it is easier, I would also use an online reference to get the TSS from but I thought it was most coherent to use the same annotation files for all the analysis (including RNAseq data).
Thanks, Roman
Are you just pulling the start position of each entry (each line) in GTF ?
Hi Alex..
I’m planning to create my own file of TSS with upstream and downstream region using the gencode annotation gtf file..I saw your post and I would like to know more about how did you upload the gtf file in R, how did you define the TSS regions and etc. could you please help me with that??
Thanks