Question

Pooling Replicates With Different Read Lengths

1

Entering edit mode

11.4 years ago

KCC ★ 4.1k

I have some ChIP-seq data. The 3 replicates have widely differing read lengths. I am intending to call peaks on each replicate, and then merge them to make a pooled data set and call peaks on that data set as well. For each of the three, the read length was 40,50 and 100.

If I pool them, I thought it might be a good idea to either artificially shorten the read length to 40 for all reads in all samples, or lengthen them all to 100. It seems to me that it's a safe bet than the fragment length for each sample is at least 100. So, I thought it shouldn't be a problem to just assume a read length of 100 for all samples.

I have also considered not altering the read length and pooling the samples just as they are. The local average read length in any sufficiently large portion of the genome, should be relatively constant.

Would you lengthen all three samples to 100bp, shorten them all to 40bp or neither?
Lets say I lengthen the sample with 40bp reads to 100bp. Would you do the peak calling for that individual sample on the 40bp reads or the new 100 bp reads?
Does it make sense to merge these? (I am planning to do an analysis of reproducibility, based on IDR and merge if they seem reproducible)

chip-seq replicates • 3.2k views

ADD COMMENT • link updated 11.4 years ago by matted 7.8k • written 11.4 years ago by KCC ★ 4.1k

0

Entering edit mode

Could you please explain to me how do you plan to lengthen 40bp reads?

ADD REPLY • link 11.4 years ago by Biomonika (Noolean) 3.2k

1

Entering edit mode

Take the 40bp reads for instance, I mean I will align reads to SAM usiing bwa, convert to BAM using samtools, and convert the reads to bed format using bamToBED. Then I can apply an extension of 60bp, extending reads on the forward strand forward and reads on the reverse strand in the reverse direction by 60bp. The approach would be somewhat similar to slopBED in the bedtools package . The tool slopBED is not strand sensitive, so I will probably write my own script.

ADD REPLY • link 11.4 years ago by KCC ★ 4.1k

0

Entering edit mode

Thank you for your explanation.

ADD REPLY • link 11.4 years ago by Biomonika (Noolean) 3.2k

score 0 · Answer 1 · 2013-07-11

To me, the core issue here is mappability. The 100-bp reads will be able to access a larger fraction of the genome with confident (and unique, if you filter on that) mapping positions.

The question is how much this will affect peak calling. My intuition says that, with everything else hopefully being equal, you will get more peaks and more accurate peaks with greater read lengths (as mappability confidence goes up). You might get "new" peaks in the 100-bp dataset in regions that are hard to map with shorter reads. However, I haven't done a similar experiment so maybe it doesn't matter all that much in practical terms.

If it were me, I would do it both ways: compare peak calls on the original read sets (differing lengths) and then compare peak calls on shortened reads (so they all are 40-bp). I predict that reproducibility will be greater for the homogenized dataset, but I guess you will find out the truth.

Lengthening the reads doesn't make sense to me, since there's no way it can improve mapping, and it won't affect most peak callers.