I have some ChIP-seq data. The 3 replicates have widely differing read lengths. I am intending to call peaks on each replicate, and then merge them to make a pooled data set and call peaks on that data set as well. For each of the three, the read length was 40,50 and 100.
If I pool them, I thought it might be a good idea to either artificially shorten the read length to 40 for all reads in all samples, or lengthen them all to 100. It seems to me that it's a safe bet than the fragment length for each sample is at least 100. So, I thought it shouldn't be a problem to just assume a read length of 100 for all samples.
I have also considered not altering the read length and pooling the samples just as they are. The local average read length in any sufficiently large portion of the genome, should be relatively constant.
Would you lengthen all three samples to 100bp, shorten them all to 40bp or neither?
Lets say I lengthen the sample with 40bp reads to 100bp. Would you do the peak calling for that individual sample on the 40bp reads or the new 100 bp reads?
Does it make sense to merge these? (I am planning to do an analysis of reproducibility, based on IDR and merge if they seem reproducible)
Could you please explain to me how do you plan to lengthen 40bp reads?
Take the 40bp reads for instance, I mean I will align reads to SAM usiing bwa, convert to BAM using samtools, and convert the reads to bed format using bamToBED. Then I can apply an extension of 60bp, extending reads on the forward strand forward and reads on the reverse strand in the reverse direction by 60bp. The approach would be somewhat similar to slopBED in the bedtools package . The tool slopBED is not strand sensitive, so I will probably write my own script.
Thank you for your explanation.