Hello.
I have 4 shotgun sequencing data of soil microbial communities. I have compiled a custom database of genes of polyciclic aromatics catabolism and mapped my raw shotgun reads to this database. So from each sample I obtained reads corresponding to PAH catabolic genes. For example:
sample 1 - 100 reads
sample 2 - 200 reads
sample 3 - 300 reads
sample 4 - 150 reads
And the question is: can I, based on the number of reads, make conclusions about the number of genes in the samples? I.e. can I say, that the sample 3 contains more PAH genes than sample 1? Also I have carried out the taxonomic analysis, but as for the genes, it's not so obvious.
Thanks.
Do you have multimapping reads, if you do then you really can't conclude anything. If the reads are mapping uniquely then you can maybe say that that subset of genes appears to be expressed (ones that have reads aligning to them) but you can't conclude anything about number of genes.
You are also aligning to a custom (reduced representation) database so aligners will try and align reads to genes in your database even if they may not have originated from those genes.
Assuming that you have biological replicates, the best way to answer your question in literally count how many PAH catabolic genes you have in your samples. For each sample, assemble your reads in contigs with a metagenomic assembler of your choice > run prodigal in metagenome mode to detect the protein coding genes > use the protein sequence in eggNOG to identify which proteins have a KO (KEGG Orthology) ID associated with the degradation of PAH > count how many PAH KO you have in each sample:
The numbers indicates the amound of CDS, e.g. K11943, identified in your samples