Question

normalizing atac seq samples with huge difference in coverage.

1

Entering edit mode

5.0 years ago

jinkyustar ▴ 10

I recently generated ATACseq data for 10 patient samples, but their visualization of RPGC normalized bigwig file on IGV is too different from one another when the data range is group-autoscaled. In contrast, when they were individually autoscaled, it shows comparable size of consensus peak and differentially expressed peaks in the patient group as we expected to see.

The example of it is shown in the attached picture.

I assume this variance came from varying library complexity caused by different number of cells used during the library prep. My PI wants to reduce the difference between these samples, so I tried to normalize bam files of these samples by using bamCoverage and scale factor, calculated from DEseq normalization, following the Greenleaf lab’s atac-seq analysis workflow.

Unfortunately this normalization method did not make the bigwig file look that much similar.

Could you let me know if there is better way to normalize these samples?

enter image description here

when I first used bamCoverage, I used following command.

$BAM_COVERAGE -b ${sample_name}.final.bam -o ${sample_name}.normTo1x.bw -bs 10 --normalizeTo1x $BAM_COVERAGE_NRM_TO_1X --blackListFileName $BLACKLIST_REGIONS --ignoreForNormalization chrX chrY --ignoreDuplicates -p $NTHREADS

Then the result was like the left panel of the picture above.

so I decided to use scale factor from DEseq normalization. I used following R code.

Finding Consensus regions and regions to count

myGRangesList_cov_cd14 <- GRangesList(myPeaks_cov_cd14)
reduced_cov_cd14 <- reduce(unlist(myGRangesList_cov_cd14)) 
consensusIDs_cov_cd14 <- paste0("consensus_", seq(1, length(reduced_cov_cd14)))  
mcols(reduced_cov_cd14) <- do.call(cbind, lapply(myGRangesList_cov_cd14, function(x) (reduced_cov_cd14 %over% x) + 0))
reducedConsensus_cov_cd14 <- reduced_cov_cd14
mcols(reducedConsensus_cov_cd14) <- cbind(as.data.frame(mcols(reducedConsensus_cov_cd14)), consensusIDs_cov_cd14)
consensusIDs_cov_cd14 <- paste0("consensus_", seq(1, length(reducedConsensus_cov_cd14)))
return(reducedConsensus_cov_cd14)
consensusToCount_cov_cd14 <- reducedConsensus_cov_cd14
consensusToCount_cov_cd14 <- consensusToCount_cov_cd14[!consensusToCount_cov_cd14 %over% blklist & !seqnames(consensusToCount_cov_cd14) %in% "chrM"]

regionsToCount_cov_cd14 <- data.frame(GeneID = paste("ID", seqnames(consensusToCount_cov_cd14),   start(consensusToCount_cov_cd14), end(consensusToCount_cov_cd14), sep = "_"),   Chr = seqnames(consensusToCount_cov_cd14),  Start = start(consensusToCount_cov_cd14),    End = end(consensusToCount_cov_cd14), Strand = strand(consensusToCount_cov_cd14))

Counting..

 fcResults_cov_cd14 <- featureCounts(bamsToCount_cov_cd14, annot.ext = regionsToCount_cov_cd14, isPairedEnd = TRUE, countMultiMappingReads = FALSE, maxFragLength = 100)

myCounts_cov_cd14 <- fcResults_cov_cd14$counts

normalization..

metaData_cov_cd14 <- data.frame(Group_cov_cd14, row.names = colnames(myCounts_cov_cd14))
atacDDS_cov_cd14 <- DESeqDataSetFromMatrix(myCounts_cov_cd14, metaData_cov_cd14, ~Group_cov_cd14, rowRanges = consensusToCount_cov_cd14)
atacDDS_cov_cd14 <- DESeq(atacDDS_cov_cd14)
sizeFactors(atacDDS_cov_cd14)

result is..

cov17     cov20     cov21     cov47     cov72     cov96     HD243 
1.0550277 1.0587087 1.9685292 0.7116851 0.7409899 1.0541563 1.0426742

Using this size factors I ran bamCoverage again.

bamCoverage -b cov47_S1_merged.final.bam -o cov47_S1_merged.deseqSF.bw -bs 10 --scaleFactor 1.9685292  --ignoreForNormalization chrX chrY --ignoreDuplicates -p 2

Some samples became similar to each other, but overall, they still show huge variance in data range on IGV viewer.

How can I address this problem? Thanks!

ATAC normalization • 3.0k views

ADD COMMENT • link 5.0 years ago by jinkyustar ▴ 10

1

Entering edit mode

Be aware that bamCoverage multiplies the raw coverage with the provided scale factor whereas DESeq2 divides it. You have to take the reciprocal value of the factors and feed this into bamCoverage. I also recommend inspecting normalization between pairs of samples using MA-plots. A properly normalized pair should have the majority of data points centered along y = 0. YOu obviously have notable differences in signal-to-noise ratio based on these plots so it is not surprising that the per-million scaling from bamCoverage performs poorly. DESeq2 (if used properly) should actually work well. Try with the reciprocal values.

ADD REPLY • link 5.0 years ago by ATpoint 88k

0

Entering edit mode

I really appreciate your comment. I tried using the reciprocal value of size factor from DEseq in BamCoverage, but they look very similar to the previously normalized set of data, except for the data range is increased for all samples. Could using consensus peak regions as counting regions during DEseq normalization affect the values for the size factor?

ADD REPLY • link 5.0 years ago by jinkyustar ▴ 10

0

Entering edit mode

Please see How to add images to a Biostars post to add your images properly. You'll need to use a host like imgbb, which allows public access to images via a URL. URLs that need authentication will not be useful.

ADD REPLY • link 5.0 years ago by Ram 45k