Hello everybody,
I’m new to bioinformatics and I am currently proceeding to my first ChIP-Seq analysis. I have questions concerning the mapping of my reads and data normalization of samples containing elevated number of duplicates.
During the library formation some samples had more PCR cycles than others. This results in a bigger library size and a number of duplicates much more important at the end of the day. When proceeding to the mapping of my reads with Bowtie, I’d like to restrict my library size to 10 000 000 reads for every sample to be able to compare them afterwards. Due to the high amount of duplicates in some of my samples (up to 65%) once I remove them from my bam file, shouldn’t I end up with a much smaller mapped library? Which in that case, correct me if I’m wrong, defeats the purpose of setting a fixed number of reads at the beginning. By doing so am I not introducing a big bias as removing 20% or 65% of reads out of 10 M will not give the same result?
Shouldn’t it be better to redo the mapping with a set of “clean” reads with no duplicates?
If not, how can I erase the bias introduced by the different number of PCR cycles among my samples and between my sample and my input?
I looked at the Bowtie user manual and I saw that you can set arguments for multimapping but nothing to map unique reads at that stage. I’m sure there is a good reason for that but I missed the point… I hope my explanations were clear enough to be understood by everybody.
Thank you in advance for your help!
Thank you for your detailed answer! I indeed use Bowtie for now because my reads are short (around 50 bp for my R1 files and around 35 for my R2 files) but I was planning on trying with Bowtie2 to see if it gives better results. I saw people were recommanding the MAnorm paper on other threads so I guess I'll have a look at it and keep in mind your advices. Cheers!
MAnorm is a normalization method that is relatively similar to the ones that DESeq2 and edgeR use, at least from the principle of assuming that many regions do not have differential binding. It is old and not maintained anymore. Would not bother with it. What did they recommend it for? What is the question you want to answer? if it is only the normalization then put your count matrix into DESeq2 or edgeR and get normalized counts from that. The critical question is if you expect global changes in binding profile. If so then a bin-based normalization might be desirable. Check the
csaw
manual for a discussion on normalization strategies.