I have an ATAC-seq experiment and generated a PCA of my samples. I am dealing with three different tissues in a non-model organism. My issue is I do not understand what (if any) issues exist in my PCA. It does not make sense to me. Here is the PCA I generated of all 18 samples (6 samples per tissue):
This is the code I used:
multiBamSummary bins --bamfiles ATAC_01_mapped.sorted.rem_dup.bam ATAC_03_mapped.sorted.rem_dup.bam ATAC_04_mapped.sorted.rem_dup.bam ATAC_05_mapped.sorted.rem_dup.bam ATAC_07_mapped.sorted.rem_dup.bam ATAC_08_mapped.sorted.rem_dup.bam ATAC_13_mapped.sorted.rem_dup.bam ATAC_15_mapped.sorted.rem_dup.bam ATAC_16_mapped.sorted.rem_dup.bam ATAC_17_mapped.sorted.rem_dup.bam ATAC_19_mapped.sorted.rem_dup.bam ATAC_20_mapped.sorted.rem_dup.bam ATAC_25_mapped.sorted.rem_dup.bam ATAC_27_mapped.sorted.rem_dup.bam ATAC_28_mapped.sorted.rem_dup.bam ATAC_29_mapped.sorted.rem_dup.bam ATAC_31_mapped.sorted.rem_dup.bam ATAC_32_mapped.sorted.rem_dup.bam -o ./ATACresults_picard.npz --outRawCounts readCounts.tab --centerReads
plotPCA -in ATACresults_picard.npz --rowCenter -o all_tissues-pca.png
So my assumption would be that the PCA would show three clusters, corresponding to each of the three tissue types. But instead we see this more-or-less straight line.
I've also generated a heatmap, which does indeed show clustering based on tissue type:
So I have these three regions that are perfectly segregating based on the sample, yet my PCA does not reflect this. Are there some issues in my code, or is this PCA actually informative in some way?
For what it is worth, I also ran similar code, but for each of the three tissue types to generate 3 additional PCAs. In each case, the PCAs were similarly vertical, like so:
I would appreciate any insights, corrections, etc. that might be offered. And also, mtDNA has been removed, the reads were cleaned and checked prior to any of these analyses. Cheers,
I do not know the tool that generated this, but my suggestion is to go the usual way: Call peaks, make a consensus peak set, from this create a count matrix, load into R, normalize with something like DESeq2 or edgeR, get normalized counts on log2 scale and then do a PCA with the top (for example) 1000 most variable regions.
This was done using deeptools. I'll try using featurecounts to get the matrix and try an edgeR tutorial. I believe edgeR is used more than DESeq2 for ATACseq, but I'll check to make sure.
Their inference is similar, it's a metter of personal preference.