Question

Normalize Metagenomic Quantification Counts.

1

Entering edit mode

11.3 years ago

Phil S. ▴ 700

Hi guys,

I have a rather complex or maybe to some of you naive question. I go Illumina pe 100bp sequencing which targeted human DNA. Now there is a noteable protion of reads which do not map to the human genome (and this was intended). Anyways, i mapped those onto a custom inhouse db of humand pathogen / viruses / archeae etc. This was done for 3 groups including several replicates. (Control, mild disease, severe disease). After looking at the reads etc. I'm thinking about investigating the difference in abundace of species in those three samples. Unfortunately the sequencing depth is not really equal across samples there I need to somehow normalize the data which is in fact reads assigned to a certain species per sample. I tried the 'usual' way of tpmt (transcripts per millions transcripts) but this does not work very well since many organisms which are of high interest only map with like 100-200 reads. Trying to infere something statistical is not possible since every count gets so low. I tried lowering the tmpt value to say 1000 but I am not sure whether this is acutally valid.

Do you guys have any recommends of how to possibly normalize such (count)data?

Thanks,

phil

metagenomics normalization • 5.0k views

ADD COMMENT • link updated 10.4 years ago by andre ▴ 30 • written 11.3 years ago by Phil S. ▴ 700

score 2 · Answer 1 · 2013-12-12

2

Entering edit mode

11.3 years ago

JC 13k

Reading your description I'm assuming that it's RNAseq data, if this is the case, you can identify the microbial species but hardly you will correlate with population distribution, because you are quantifying transcripts which have a large variability in expression.

Imagine 2 species (A & B), both contains a C gene but A is expressing 10X more C, so if you use genomic amplicons (16S, whatever) you can see the proportions of A vs B, but if you compare the transcriptome (using only C) you will see 10X more of A just because it's more expressed.

ADD COMMENT • link 11.3 years ago by JC 13k

0

Entering edit mode

Yep it is RNA-Seq. It's just kind of data which was sitting around. And I wanted to try out stuff since we are going to sequence 16S just this week. @cwarden45 counts-per-thousand seems pretty good.

edit: thank you!

ADD REPLY • link 11.3 years ago by Phil S. ▴ 700

score 1 · Answer 2 · 2013-12-12

1

Entering edit mode

11.3 years ago

Charles Warden 8.3k

Someone may be able to provide a better answer, but I would typically use count-per-million with miRNA-Seq data. So, I used count-per-thousand when analyzing some American Gut metagenomic data:

http://cdwscience.blogspot.com/2013/11/metagenomic-profiles-for-american-gut.html

This was for 16S rRNA. You might need to use something else if you are taking extra reads from RNA-Seq data

ADD COMMENT • link 11.3 years ago by Charles Warden 8.3k

0

Entering edit mode

hey thanks after trying out some things I am not really satisfied with what i got. Here it is http://imgur.com/a/HvTMI . It seems like there is still some kind of bias towards one of the datasets. Especially using the boxplots they should have an equal median, shouldn't they?

Thanks,

best Phil

ps. Happy new year..

ADD REPLY • link 11.3 years ago by Phil S. ▴ 700

0

Entering edit mode

For RNA-Seq work in general, I think the distributions look relatively similar. Ideally, the medians should be essentially the same, but sometimes this isn't the case. In cases where this potentially bothered me, I found that median centering actually made things worse in terms of concordance with positive controls and functional enrichment.

That's my 2 cents. I can't guarantee that median centering, quantile normalization, etc. won't help in your specific case, but this is at least how things worked out for me in the past.

ADD REPLY • link 11.3 years ago by Charles Warden 8.3k

0

Entering edit mode

Thanks I'll give it a try!

ADD REPLY • link 11.2 years ago by Phil S. ▴ 700

0

Entering edit mode

When you use quantile normalization. Do you add the different expression tables up to one large table and perform normalization on this one or do you normalize separately for every sample?

ADD REPLY • link 11.2 years ago by Phil S. ▴ 700

1

Entering edit mode

The point of quantile normalization is to make the signal distirbution the same for each quantile of each sample (so, the median, top 25%, bottom 25%, etc. corresponds to the same expression level in each sample). It can't be applied to an individual sample.

ADD REPLY • link 11.2 years ago by Charles Warden 8.3k

score 0 · Answer 3 · 2014-11-26

0

Entering edit mode

10.4 years ago

andre ▴ 30

i think this article from april 2014 in plos compbiol gives a good inside on what to do in this case.

ADD COMMENT • link 10.4 years ago by andre ▴ 30