Hello all. I'm trying to do differential expression analysis using ATAC-seq data sets, but I'm having trouble getting HTseq-count or featureCounts to actually count the data. Any good protocols to accomplish this?
Thanks!
Hello all. I'm trying to do differential expression analysis using ATAC-seq data sets, but I'm having trouble getting HTseq-count or featureCounts to actually count the data. Any good protocols to accomplish this?
Thanks!
Differential accessibility, not differential expression.
We tend to use CSAW for this. The full pipeline is here, which is essentially:
alignmentSieve
) so only those <=150 bases (i.e., likely from open chromatin) are present.Alternatively, you could use diffBind in the last step. As Venu pointed out, you can use multiBamSummary
with a BED file to get your counts, but CSAW at least has functions that also do that.
The way I approach this, is to first use MACS2 for peak calling using all BAM files together. This will generate a bed file with coordinates of all peaks. Then I use bedtools, something like this:
bedtools coverage -b ${bamFile} -a All_peaks.bed > CoverageFiles/${bamFile%.bam}.txt
And then merge all columns with the read counts from the files for edgeR.
Bedtools can count alignments from multiple BAM files (see http://bedtools.readthedocs.io/en/latest/content/tools/multicov.html)
"multiBamSummary" i guess this would be faster and straight forward since its multithreaded ,as i have some confusion about the featurecounts in context of Chip/ATAC data regarding the generation of saf file which is needed as i see below. It needs a combined peak file i guess so its is merging all the peak file together which is same as merge function of bedtool or it is something else?
Sorry, did not notice that sentence. There we go:
featureCounts cannot take bed or narrowPeak files directly, but wants a custom format called SAF:
## Make SAF file (+1 because SAF is 1-based, BED/narrowPeak is 0-based)
awk 'OFS="\t" {print $1"."$2+1"."$3, $1, $2+1, $3, "+"}' peaks.narrowPeak > peaks.saf
## Then run fc:
featureCounts --read2pos 5 -a peaks.saf -F SAF -T <int> -p -o peaks_countMatrix.txt *.bam
-T = number of threads; -p indicates paired-end data; --read2pos5 limits the read to the 5' end (so the actual cutting site but this is optional)
In my current pipeline I typically only use the 5' end of each read (so of both the forward and reverse read) as this is the cutting site of the transposome and treat reads as single-end for this purpose. Limiting reads to those with a certain insert size imho throws away the majority of reads so I take all of them. featureCounts has an option to only count 5' ends.
"peaks.narrowPeak" I got to this point but a little doubt, let say I have multi condition data ,so the "peaks.narrowPeak" would be combination of all the peak called by any peak calling tool .So to get the "peaks.narrowPeak" file do I have merge all the peak files or overlap the peak files ?
There is no clear guideline for this. I typically call peaks with Genrich
in ATAC-seq mode (this peak caller has a mode to handle replicates) over all replicates of all groups and then combine the peaks of all groups into a consensus file using bedtools merge
. You can also use macs2
to do the same with merged BAM files per condition or all BAM files of the experiment merged together. If you browse BioC support page for this issue you will find different opinions. Probably the safest is to merge all BAM files to get kind of the "average" location of all potential peaks, and differential analysis will then do its part to tell which are indeed significantly different.
It doesn't matter if you use DESeq2 or edgeR, both methods are well-accepted and perform similarly. I personally use edgeR but that has no specific reason. Yes you need replicates. What I am referring to is the merge of the peak regions which is the template for creating the count matrix where every sample is a column and every region is a row.
Hi ATPoint,
I have seen multiple of your answers about DE/Differential Accessibility, and they have been extremely helpful. However, with this step specifically, I am wondering how to annotate these BED regions with gene names/symbols/ID's? I understand the logic of generating a consensus file with bedtools merge, but I would like the consensus file to contain gene ID's or names for my downstream analysis. Do you have any recommendations for this? Thanks so much.
No gold standard for this. You can use the closest gene, or the closest expressed genes, or the closest differential gene. That very much depends on the question and your setup, there is lots of literature about association strategies between distal regions and genes.
That is understandable. Do you have a specific method/ set of packages that you use for annotation, namely for a simple association of peaks to the nearest gene? I have been unable to find methods for this, and so far my attempt to use Homer's AnnotatePeaks has failed, as it requires strand information (My BED only contains chr, start, end). Thanks so much.
Another ATAC-Seq pipeline is available here. As part of the pipeline a count matrix is generated and an Rscript template for differential peak calling is part of the pipeline (script is here).
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
HT-seq or featureCounts are not really suitable (tedious work to prepare an input format with bed regions) for ChIP/ATAC data. Use
bamCoverage
ormultiBamSummary
fromdeeptools
. All you need is BAMs and a bed file.Disagreed. See my comment below. bedtools does not really support paired-end data, while featureCounts perfectly does. Preparing a featureCounts input file from BED is as easy as:
" peaks.narrowPeak" is it the combination of all the peaks form different condition ?
It is what you give it, but yes, that would imho be a reasonable choice, so the peak list produced when calling peaks on the merged BAM files.
actually i don't have merged bam files as i have patient data which are not similar .So i thought would be useful if I merge all the bam files. But is there other way where i can do peak call on different condition and merge those peaks?