Question

Peak calling on FastX-collapser processed data

0

Entering edit mode

10.2 years ago

kevt1999 • 0

I am processing IP data by first aligning with Bowtie then doing peak calling with MACS. To save CPU cycles, I was told that I should use the FastX-collapser tool to remove duplicate reads before feeding my reads into Bowtie. The collaper tool takes fasta entries of the same length and sequence and combine them into a single entry with the occurrence appended to the end of the ID with a "-". For example:

>1
GGAC
>2
GGAC
>3
GGAC
>4
ATCGTTT

Becomes:

>1-3
GGAC
>2-1
ATCGTTT

My question is, does MACS 1.4 (http://liulab.dfci.harvard.edu/MACS/README.html) take the "- appended read count" from collapsed data into account? I assume it doesn't and think it needs this info to correctly calculate peak enrichment. However, MACS seems to go though it's own process of removing duplicate reads, suggesting that duplicate reads might not be important after all.

Does anybody know if the read count matters? Do I need to re-expand my data set after Bowtie alignment before feeding it to MACS?

ChIP-Seq • 2.1k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by kevt1999 • 0

Ram · Answer 1 · 2015-06-12

0

Entering edit mode

10.2 years ago

Ian 6.1k

Personally I do not see much to be gained by processing the reads in the way you describe. MACS/MACS2 does remove redundant reads sharing the same strand and 5' coordinate, however the --keepdup N / auto parameter can allow some level of redundancy, for example, when you have high read coverage and a short genome. I hope that helped.

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by Ian 6.1k

0

Entering edit mode

Thank you for the reply!

By "do not see much to be gained", do you mean I should not worry about CPU cycles, and feed the fully duplicated data set though Bowtie and then MACS? I see I have about 1-10 million duplicates for each read.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by kevt1999 • 0

0

Entering edit mode

That is quite exceptional duplication, is it specific for the protocol? If it is normal ChIP-seq then something has gone wrong. Have you checked the reads with fastqc, or the like?

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by Ian 6.1k

0

Entering edit mode

You are right, I am doing an RNA-IP, which involves fragmenting total RNA and pulling down RNA with an antibody targeting methylated RNA. I posted in the ChIP-seq section because my analysis pipeline is closer to ChIP-seq then RNA-seq.

Yes, I'll do fastqc on the data after index trimming.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by kevt1999 • 0