Question

Normalization scheme to use for upstream sequences read count in (Chip-seq histones)

0

Entering edit mode

8.9 years ago

Saad Khan ▴ 440

Hi,

I am comparing 1000bp upstream sequences of TSS amongst three species (without replicates). In order to compare the upstream 1000bp with the KA/Ks of the gene (pairwise comparison between species) I had normalized the 1000 bp upstream region using the rpkm formula.

I was wondering if this is the best normalization measure I can use in the absence of any replicates or are there other normalization methods I should try and what are they?

Regards

normalization RPKM histones • 3.1k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.9 years ago by Saad Khan ▴ 440

Ram · Answer 1 · 2016-02-02

0

Entering edit mode

8.8 years ago

Devon Ryan 104k

FYI, for fixed-width regions, RPKM devolves to CPM.

My main concern for something like this would be GC bias differences in the samples. Did you do input samples for each species as well? That'd allow you to correct the GC bias (see computeGCBias and correctGCBias in deepTools).

Aside from that, I would prefer to get scaling factors from non-peak regions, or at the very least to use a more robust method (TMM/RLE/quantile normalization). I think that any of these will be preferable to RPKM/CPM, since if you have any peaks with crazy amounts of duplicates (or that were simply absurdly highly covered) then they'll skew the stats.

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.8 years ago by Devon Ryan 104k

0

Entering edit mode

How do you get scaling factors from non-peak regions can you please elaborate. Also TMM/RLE and quantile normalization require replicates AFAIK.

PS :I did not have chip input samples

ADD REPLY • link 8.8 years ago by Saad Khan ▴ 440

0

Entering edit mode

Replicates aren't needed or even used in TMM, RLE, or quantile normalization. All you need are multiple samples, which can be the ones being compared.

To get scaling factors from non-peak regions, you would first call peaks and then use the counts outside of them for the normalization.

Since you lack inputs, you're going to have more work on your hands when it comes to validation.

ADD REPLY • link 8.8 years ago by Devon Ryan 104k

0

Entering edit mode

So in my case I don't have multiple samples in one species I have different histone modifications in 3 species (for a particular tissue) that I am trying to compare here.

ADD REPLY • link 8.8 years ago by Saad Khan ▴ 440

0

Entering edit mode

Hi Devon,

Do you have something to elaborate on my question.

thanks

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Saad Khan ▴ 440

1

Entering edit mode

Not unless you'd like me to elaborate on something in particular.

ADD REPLY • link 8.8 years ago by Devon Ryan 104k

0

Entering edit mode

So I have only one sample for each species. What in that case would be the most correct way to normalize

CPM or TMM

ADD REPLY • link 8.8 years ago by Saad Khan ▴ 440

0

Entering edit mode

Is it possible to use TMM on such data and how?

ADD REPLY • link 8.8 years ago by Saad Khan ▴ 440

0

Entering edit mode

Sure:

Call peaks
Make a bed file of some non-peak regions. Ideally there would be a good number of these (i.e., 1000 or so).
TMM
Scale or weight accordingly

Alternatively, play around with scale factors until things look right.

ADD REPLY • link 8.8 years ago by Devon Ryan 104k

0

Entering edit mode

So what you are saying that I should use the peak regions as sample1 and non-peak regions as sample2 for doing TMM normalization?

ADD REPLY • link 8.8 years ago by Saad Khan ▴ 440

0

Entering edit mode

No, you would use the non-peak regions for all of the samples. TMM would give you scaling factors accordingly that you would then need to apply to the peak counts. This is the same as how ERCC spike-ins are used in RNAseq.

ADD REPLY • link 8.8 years ago by Devon Ryan 104k

0

Entering edit mode

Sorry I am still not able to follow.I have only one sample for each histone modification in one species. So if you are saying I need to use non-peak regions for all of the samples do you mean I should combine non-peak regions for all different histone modifications? Can you point me to an example?

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 8.8 years ago by Saad Khan ▴ 440

0

Entering edit mode

Ah, right, I'd forgotten the context of your question. You'll just have to play around with TMM a bit if you want to use it. There are no examples of this that I'm aware of, but the general idea would be to ignore the fact that your non-peak regions are in different areas for each sample and to just use counts in some fixed number of them per sample. I don't have the time to put together a long example of this, unfortunately.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 8.8 years ago by Devon Ryan 104k