Hello, I have run a metagenomics DNAseq WGS (2*150bp ) experiment with 20 feacal samples (10 disease and 10 control) from different patients. I have aligned the reads back to a reference bacterial genome.
I would like to compare the control group to the disease group and see if i can find differences on the bacterial genes. First approximation would be present/absent and later SNPs....
What i have done:
- ALign patients and controls reads to the reference bacterial genome
- Used featuresCount (-p with fragments) to get counts. Not using multimap (Is that correct ?)
- Used the output matrix from featuresCount as input to DESEq.
- Normalized data with Deseq and compared group with deseq using control vs disease as contrasts.
Is that correct ?? Note that this not RNA data but DNA data. Is the normalization approriate in such context
Thanks for advise.
If you have done all of above did you get anything useful that makes sense?
You have not said anything about the amount of data you had for all samples (was it equivalent)? What did the alignment percentages look like since you seem to be using a single bacterial genome (reads from other bacteria will multi-map)? Since these patients were likely not eating the same diet how are you accounting for those differences?
Not sure about what you mean by amount of data by featureCounts reports different counts for each of the genes. Also i computed the genome coverage for my reference bacteria and got more than 80x coverage fir most of the samples. The bacteria is naturally occuring in the gut so i was expecting to find most of the genes. As for multimapping you are right, that’s why i think i should include multimapping sinceany of the genes will be present in other species. Binomial distribution serms ok, at least from the deseq control plots and estimates.
Any paper describing a similar procedure?
Did the samples have roughly the same number of reads? If some samples had 40 million reads while others had 10 million then you were not sampling at the same level. This is where @Wouter's comment below about normalization of the data becomes important. Was the alignment percentage more or less similar for all samples?
What I meant by multi-mapping was that reads from other bacteria will also map to the reference you are using since some of the genes would be conserved across bacteria. So you can't always be sure that you are counting reads that originated from the genome you are aligning to.