I want to normalise my paired-end ChIP-seq data (4 samples treated, 4 samples control) to reads per million (RPM). So far my calculation takes the number of reads mapped to a position, divides it by the total number of mapped reads in the sample, and then multiples this by one million. My question is, should the number of mapped paired-end reads be counted as 1 or 2 in the calculation? I believe Samtools reports the total number of mapped reads as half the actual number of paired-end reads which map?
My current pseudo script for normalising to RPM:
- mappedReads = samtools view -c -F 4 input.bam
- scalingFactor = 1000000 / mappedReads
- bedtools genomecov -ibam input.bam -bg - scale scalingFactor -g chrom.sizes > output.bedGraph
- bedGraphToBigWig input.bedGraph chrom.sizes output.bigWig
For paired end data, rpm=fpm (fragments per million). So, it should be counted as one.
Just to clarify, you are saying that each pair of reads should be counted as just one? So in a perfect world where 100% of my reads align, the number of total alignments would be half the total number of reads in the data?
Yup. Count a pair of reads as one. For single end reads each read represents a fragment it is sequenced from whereas for paired-end, a pair of sequence represents the fragment it is sequenced from. As simple as that.