Hi all, I ran ATAC-seq pipeline such as nf-core and got output files such as bam, bigwig, broadpeak. Would you suggest a way to get genes associated with open chromatin regions? I used ChIPpeakAnno but for DiffBind. Thank you so much!
Hi all, I ran ATAC-seq pipeline such as nf-core and got output files such as bam, bigwig, broadpeak. Would you suggest a way to get genes associated with open chromatin regions? I used ChIPpeakAnno but for DiffBind. Thank you so much!
ATAC-seq annotation to gene names
It's one of the most unsatisfying problems in bioinformatics since imo all existing solutions are not good and only crude approximations of the reality. False assignment rate is probably gigantic.
There is plenty of literature on this problem discussing several approaches (pattern-based, correlation-based etc) but I cannot say that any of these has crystallized as a gold standard at all. It usually comes down to what is written in the linked answer. These distal-to-promoter associations are celltype-specific, and may change based on perturbation. It's really a tricks problem.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you so much for your help! I am quite surprised that 2.4 years later, we still haven't a better solution as you said. So if we are not sure which genes are associated with open chromatin regions, what is the most helpful info we can get from an ATAC-seq experiment? If I have only one condition and don't perform differential accessibility analysis.
You can approximate which regulatory elements control a gene. With only one condition you don't know how accessability changes, so you would naively need to assign all called peaks within a given window to that gene, while using differential regions narrows it down by quite a lot. You can scan differential regions for motifs, thereby approximating the involved transcription factors. You can identify whether the treatment causes specific changes in terms of mainly restricting or promoting accessability...I mean...you have to know why you did the experiment, shouldn't you?
Yes, the point of the experiment is to figure out how chromatin from a disease sample has changed in comparison with a control sample. I used ChIPpeakAnno to get the genes that DiffBind said these peaks are different in chromatin accessibility. But DiffBind gave me around 100k peaks and 20k genes after annotating so seem I didn't get much useful info yet. I hope to get something like a list of genes in control are closed but these gene in diseased are more open.
You might want to review your DiffBind analysis. 100k differential peaks is unlikely to be meaningful. Many ATAC-seq experiments do not even yield 100k peaks in total.
Thank you! So 115941 ranges are not correct?
Read my comment, I said
differential
. The way you phrased it it seemed that DiffBind gave you 100k differential peaks. Total peaks is not interesting, how manydifferential
(e.g. FDR < 0.05, abs(logFC)>1) do you have. What does20k genes after annotating
mean? Which data do you have, is it ATAC-seq and something else?When I applied abs(log2FC) > 2.5, I got around 115941 peaks has FDR < 0.05 and abc(logFC>1) I have a lot. I tried to annotated each peak with a gene that I added a symbol column and I got around 20k unique genes name.
I have my doubts that this is meaningful, it seems that everything changes which makes no intuituve sense, but without having the data I cannot really say what is wrong. I advise, if at all possible, to collaborate with someone experienced locally (or have your PI find a good collaboprator) to have a look at this analysis and figure out what is going wrong. Continuing with this excessive amount of "DE regions" is unlikely imo to yield anything substantial. You need to narrow down the DE regions, for example do more prefiltering and prioritize large FCs, for example lfcShrink in DESeq2 or glmTreat in edgeR. If this is all new to you I again recommend to find a collaborator to make sure you don't waste many weeks on results from a potentially flawaed analysis. Hope that helps.
I has been looking for a mentor for more than a year but no luck yet. I reached out to all I know locally but helping someone is not an obligation so even I has a few replies, I know more a little bit but not completely solve the question. Which I can't agree more that it was very ineffective. For some reasons, my PI doesn't want a collaborator anymore. I used Diffbind which I remember I was recommended and its code are quite straightforward so I guess if errors happen from the sample sheet or the nf-core output. Another bioinformatician who working at a biotech company analyzed the same data and also got around 100k peaks.
Does this sample sheet look correct?