I am processing IP data by first aligning with Bowtie then doing peak calling with MACS. To save CPU cycles, I was told that I should use the FastX-collapser tool to remove duplicate reads before feeding my reads into Bowtie. The collaper tool takes fasta entries of the same length and sequence and combine them into a single entry with the occurrence appended to the end of the ID with a "-". For example:
>1
GGAC
>2
GGAC
>3
GGAC
>4
ATCGTTT
Becomes:
>1-3
GGAC
>2-1
ATCGTTT
My question is, does MACS 1.4 (http://liulab.dfci.harvard.edu/MACS/README.html) take the "- appended read count" from collapsed data into account? I assume it doesn't and think it needs this info to correctly calculate peak enrichment. However, MACS seems to go though it's own process of removing duplicate reads, suggesting that duplicate reads might not be important after all.
Does anybody know if the read count matters? Do I need to re-expand my data set after Bowtie alignment before feeding it to MACS?
Thank you for the reply!
By "do not see much to be gained", do you mean I should not worry about CPU cycles, and feed the fully duplicated data set though Bowtie and then MACS? I see I have about 1-10 million duplicates for each read.
That is quite exceptional duplication, is it specific for the protocol? If it is normal ChIP-seq then something has gone wrong. Have you checked the reads with fastqc, or the like?
You are right, I am doing an RNA-IP, which involves fragmenting total RNA and pulling down RNA with an antibody targeting methylated RNA. I posted in the ChIP-seq section because my analysis pipeline is closer to ChIP-seq then RNA-seq.
Yes, I'll do fastqc on the data after index trimming.