Question

Compare samples from bacterial genome

0

Entering edit mode

6.0 years ago

David ▴ 240

Hello, I have run a metagenomics DNAseq WGS (2*150bp ) experiment with 20 feacal samples (10 disease and 10 control) from different patients. I have aligned the reads back to a reference bacterial genome.

I would like to compare the control group to the disease group and see if i can find differences on the bacterial genes. First approximation would be present/absent and later SNPs....

What i have done:

ALign patients and controls reads to the reference bacterial genome
Used featuresCount (-p with fragments) to get counts. Not using multimap (Is that correct ?)
Used the output matrix from featuresCount as input to DESEq.
Normalized data with Deseq and compared group with deseq using control vs disease as contrasts.

Is that correct ?? Note that this not RNA data but DNA data. Is the normalization approriate in such context

Thanks for advise.

deseq metagenomics • 1.4k views

ADD COMMENT • link 6.0 years ago by David ▴ 240

1

Entering edit mode

If you have done all of above did you get anything useful that makes sense?

You have not said anything about the amount of data you had for all samples (was it equivalent)? What did the alignment percentages look like since you seem to be using a single bacterial genome (reads from other bacteria will multi-map)? Since these patients were likely not eating the same diet how are you accounting for those differences?

ADD REPLY • link 6.0 years ago by GenoMax 147k

0

Entering edit mode

Not sure about what you mean by amount of data by featureCounts reports different counts for each of the genes. Also i computed the genome coverage for my reference bacteria and got more than 80x coverage fir most of the samples. The bacteria is naturally occuring in the gut so i was expecting to find most of the genes. As for multimapping you are right, that’s why i think i should include multimapping sinceany of the genes will be present in other species. Binomial distribution serms ok, at least from the deseq control plots and estimates.

Any paper describing a similar procedure?

ADD REPLY • link 6.0 years ago by David ▴ 240

1

Entering edit mode

Did the samples have roughly the same number of reads? If some samples had 40 million reads while others had 10 million then you were not sampling at the same level. This is where @Wouter's comment below about normalization of the data becomes important. Was the alignment percentage more or less similar for all samples?

What I meant by multi-mapping was that reads from other bacteria will also map to the reference you are using since some of the genes would be conserved across bacteria. So you can't always be sure that you are counting reads that originated from the genome you are aligning to.

ADD REPLY • link 6.0 years ago by GenoMax 147k

score 0 · Answer 1 · 2018-12-10

DESeq(2) is intended for RNA-sequencing data and as such has certain assumptions about the underlying data. For example, it assumes a negative binomial distributions, it assume you have many features (genes) and that most are not altered between your conditions. If you are sure that your data fulfills these assumptions then this should be fine, but it is more likely that you will have to take a look in the literature to see which approaches are used by projects similar to yours.

score 0 · Answer 2 · 2018-12-11

Found an interesting paper comparing difference normalization methods for WGS metagenomics methods.

Comparison of normalization methods for the analysis of metagenomic gene abundance data

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5910605/

From the authors ("The methods trimmed mean of M-values (TMM) and relative log expression (RLE) had the overall highest performance and are therefore recommended for the analysis of gene abundance data. For larger sample sizes, CSS also showed satisfactory performance)"

RLE being DESeq and TMM being Edge package from R.