Question

Merging replicates from Encode/Roadmap project

2

Entering edit mode

8.7 years ago

curious ▴ 50

Hi,

I'm processing data from the Encode project to look at the enhancer-promoter interactions. I would like to merge the replicates (technical/biological) for a given mark and cell type.

I'm not sure how to go about merging the replicates. [1] says that 'Filtered datasets were then merged appropriately (technical/biological replicates) to obtain a single consolidated sample for every histone mark or DNase-seq in each standardized epigenome.' The paper that explains [1] is this one but that doesn't explain how merging is done either.

Should technical replicates merged together and should biological replicates merged together and not in between?

The pipeline I created is: sra->fastq->fastq_trimmed->sam->bam->bam_sorted->counts I'm trimming the unmapped reads so the data from the samples are uniform (36bp). I derive region counts using bedtools' genomecov option.

thanks for reading.

ChIP-Seq encode • 3.0k views

ADD COMMENT • link updated 8.7 years ago by John 13k • written 8.7 years ago by curious ▴ 50

0

Entering edit mode

This doesn't answer the question, but I had a very similar question recently and I asked a post-doc about the difference between biological and technical replicates.

Essentially biological replicates will have much larger variance than technical replicates, so it does not make much sense to merge biological replicates together. Instead this post talks about needing biological replicates to estimate variance and dispersion of data.

For technical replicates I usually merge their fastq files together using cat prior to trimming and alignment, though I remember reading somewhere on Biostars that it might be better to run the trimming and alignment -> sorted bam and then merge replicates at that point. Someone else can clarify this i'm sure.

ADD REPLY • link 8.7 years ago by Sinji ★ 3.2k

0

Entering edit mode

thanks for the insight on merging technical replicates, Sinji. One paper discusses the difference between technical and biological replicates.

this actually reminds me of another question that I forgot to include earlier: how do I differentiate between a technical and a biological replicate? I've an automated pipeline to process about 3000 files and would like to automatically identify technical/biological replicates. I used the R packages GEOquery and GEOmetadb but they don't quite give out replicate information as far as I can tell.

It would be great to know if there are others.

ADD REPLY • link 8.7 years ago by curious ▴ 50

score 6 · Accepted Answer · 2016-03-25

I am by no means the expert here, but if the question is simply "what did Encode do?" and "what should I do?", I can probably take a shot at it :)

For Encode when they say they merged replicates to get a consolidated sample, they mean they merged the BAM files with samtools merge or similar. From the data they produced, this is most likely a fine thing to do, because read-depth wasn't super-high back then and variance between individual ChIP/DNAse sequencing runs is significantly lower than RNA - particularly at the read numbers they were mapping at. For RNA-Seq however, there is essentially no good reason I can think of to merge anything.

Regarding "what should I do", that's a much more interesting question :-) The reason is, there aren't many tools (that I know of) that make use of ChIP/DNAse replicate information. Most of the time, we end up merging everything together and treating all the reads the same. Of course we do the QC of the reads individually - looking at tracks individually in IGV or producing heat maps per-replicate - and in some scenarios we'll do Input/GC-bias correction at the run level and then merge at higher-level (signal bin counts), but only because we can't merge at the read level in those situations.

However, you never know what breakthrough is around the corner, so it would be very foolish of me to suggest replicate information isn't important. Fortunately, you can still have your cake and eat it too, by merging reads together into 1 BAM file, but tagging the reads with an RGID that is specific to their biological/technical replicate group. How useful this is depends on if software can understand the RGID field and do something useful with it. The only software I know that does is GATK, which will use the technical replicate information to model the quality of the sequencing when calling SNPs. Other than GATK though, I don't know of any software that does anything useful with replicate information - but maybe others can chime in with examples :)