Hi all!
I am currently trying to annotate TF ChIP-seq peaks of cancer cell lines to the human genome and extract those associated with promoter/TSS regions. Generally, we are interested in genes that are bound by our TF at promoter/TSS and are differentially expressed at the same time. So far, I've been using ChIPseeker's annotatePeak
function and a custom TxDb file created based on the GENCODE gtf file. I am using GENCODE gtf because we are using the same for RNA-seq mapping and transcript quantification. However, I am struggling with whether to perform the annotation based on the "gene" or "transcript" level. I read a few posts and papers about this topic and I understand the general annotation concept where each gene can have multiple transcripts with individual TSS/promoters but I am still a little bit confused about which one to use.
I tend towards specifying "transcript" but some of my peaks are annotated to the promoter/TSS region of transcripts in the GENCODE gtf with types like "Nonsense-mediated decay" or "Retained intron" and I'm not sure if this makes sense. Also, there is quite a difference in the final list of genes bound at promoter/TSS depending on whether I use "gene" or "transcript" level, which can be expected, but makes my decision even harder.
Therefore, I wanted to ask if there are people here who have more experience with this and comment or have some suggestions. Is it better to use the "gene" or "transcript" level or create a filtered gtf file instead?
Thanks in advance!