Hi all,
I am very new at bioinformatics. I have collected bulkATACseq data from 3 different timepoints (2 embryonic timepoints and 1 adult) and a positive and negative for each timepoint.
I am trying to do kmeans clustering on the combined dataset to demonstrate dynamic chromatin accessibility throughout these different timepoints and conditions.
Here is the workflow I have put my data through so far:
- checked quality of ATAC reads using fastqc
- trim adapters with cutadapt
- map to genome with bowtie
- sort sam files by coord and convert to bam
- index bam files
- make bigwigs using bamcoverage (I have tried both CPM and RPGC normalization. I have also included "extend reads", "ignore duplicates", and "center reads", bin size 10, smooth length 30)
- call peaks using Genrich
- Diffbind analysis of each possible pairwise comparison
- combined the results of each diffbind analysis into one large bed file and filtered peaks by fold change greater than 1 or less than -1, and FDR < 0.05
- sorted bed file by chromosome
- used bedtools "merge" to get rid of duplicates
- ran compute matrix (deep tools) with the following inputs: bigwig file from every condition, 1 bed file including high confidence peaks from everywhere Diffbind comparison, --referencePoint center, -b 1500, -a 1500
- used plotHeatmap with the matrix generated with computeMatrix and have tried a bunch of different numbers for kmeans clusters.
Is this a workflow that makes sense? I am not getting clusters that are specific to my adult samples, which is surprising. I am wondering if I didn't normalize my data properly. Also, on my heatmap, depending on the number I input for kmeans, I am getting clusters that are skewed way to the left or right when they should be centered. What does this indicate? I am thinking it has something to do with the bed file I am inputting -- maybe I need to cut down the number of peaks? I have tried everything I can think of to change and am not sure what to try next.
Thank you in advance!
Can you provide some results? It's a bit difficult to judge what exactly the issue might be without actually seeing the data. There could be various reasons, e.g. lack of signal, overabundance of noise, ....
What do the DiffBind peaks for the adult vs embryos look like in the heatmap (without clustering, just plotting those peaks across all the samples)? Did you try turning off the clustering and just keeping the individual DiffBind results as chunks of regions, i.e. supply each set of (unique) DiffBind peaks as a separate BED file as shown in the example here.