Due to poor coverage we have co assembled within individuals (multiple samples per individual) across time points. We then want to look at the CAZYme and nitrogen fixing profiles across each sample and then for example look at GH family abundance per sample. Thus we need to recover our sample divisions. I came up with the following method for reintroducing sample divisions where a contig can be assigned to multiple samples:
- A contig with 50% coverage and a depth of 5x is assigned to that sample
- A contig with between 25 and 50% coverage and a depth of 10x is assigned to that sample
However I have recently also seen that people do no introduce the sample divisions so physically, but rather use coverage as a proxy for abundance to get contig abundance per sample and then extrapolate CAZyme gene abundance per sample from that based on the abundance of the contig the CAZYme was found on.
Do you have any insights as to which is the best method (if either) and if the second one is best, how would you approach doing it as I don’t fully see how contig coverage can be a proxy for abundance?