Hi everyone, I'm a newbie in the field, so I have a few super naive questions. I have a set of bulk ATACseq samples and I want to do clustering on them.
- Is using log CPM ok for clustering samples? (do I need to do quantile normalization? is there a better way to cluster samples based on ATAC signal? If so, what is it?)
- For CPM, I'm wondering if the set of R commands I used are ok:
(The way that I approached it was to call peaks with MACS2, generate a consensus peakset, get raw counts per peak, get log cpm (prior count 5))
dba_object = dba(sampleSheet= <sampleSheet.csv>) # construct dba object
###### note: the sampleSheet.csv has the associated bamReads which I assume are going to be used for getting counts?
consensus_peakset = <my consensus peak set as GRanges object>
dba_object = dba.count(dba_object, peaks=consensus_peakset, score=DBA_SCORE_READS) # raw counts
counts = dba.peakset(dba_object, bRetrieve=TRUE, writeFile = <rawcounts_filename>)
Are the above couple of lines correct for getting raw reads??? I'm getting some values that are below 1 and I thought you'd only have integer values for how many reads fall in a given peak...
later on for CPM:
raw_counts_df = read.table(<rawcounts_filename>)[,c(1:3)] # removes seqname, start, & end to get a read counts matrix
cpm_log_counts = cpm(raw_counts_df, log=TRUE, prior.count=5) # gets log cpm with prior count
I'll be grateful for any advice people might have on how I can go about clustering ATACseq data and whether my approach is in any way an ok way to do it