Hello,
I have some mouse RNA-seq data and ATAC-Seq data and I am trying to correlate changes in gene expression with changes in promoter accessibility using HOMER. To do this I need a bed file of TSS around which I will analyze accessibility and then plot it with deeptools.
I am a bit confused about the number of TSS compared to the number of genes. I have a list of RefSeq TSSs which I downloaded from https://ccg.epfl.ch/mga/mm10/refseq/refseq.html and another which is included with HOMER. Both these files have around 23 thousand TSSs. My RNA-Seq data count file has ~ 46,000 ensembl ids. I am a bit confused about how to reconcile this difference.
If I only select TSSs which overlap between the RNA-seq count file and Refseq TSS file I will not be analyzing accessibility for almost half of the entries present in the RNA-seq data. Alternatively, if I download TSS for all ensemble ids from biomart, it gives me almost 100k entries as each gene can have multiple transcripts. I am a bit confused about this huge difference in no of Refseq TSS vs no of ensembl ids and the best way of doing this analysis. Would appreciate any tips.
Thanks
Thanks, that is very helpful and I will try it.
The problem is I am confused about is that my RNA seq count file has ~45000 ensemble ids and if I filter it in any way it will bias the differential expression analysis. Just to elaborate a bit more, I am interested in finding out if the deferentially expressed genes from my RNA-Seq data also show changes at the chromatin level or if they are being regulated by a chromatin independent mechanism. Standard dseq2 analysis gives me around 15000 ensembl ids that show differential expression with p value <0.05. Analysis with HOMER with the default Refseq TSS file to identify the TSS sites which show significant changes at the promoter (-500bp +100bp) gives me only around 400 sites. My preliminary conclusion is that the majority of degs are not being regulated at the chromatin level but I am concerned by the fact that the HOMER analysis was done on ~23000 refseq genes while dseq analysis was done on ~45000 ensemble ids.
I have little insight in the details of your analysis but I strongly suggest you stay consistent with the reference annotation you use. Don't mix them, this only adds uncertainty. Homer typically (at least its motif search functions) can accept custom references. Maybe try to give it the Ensembl annotations.
Thanks. I was avoiding ensemble annotations because it gives a huge number of TSSs but as you said just downloading all ensemble TSSs and supplying them to HOMER maybe the best way to do it.