Question

identification of catabolic genes in soil metagenome shotgun data

0

Entering edit mode

4.1 years ago

mewgia • 0

Hello.

I have 4 shotgun sequencing data of soil microbial communities. I have compiled a custom database of genes of polyciclic aromatics catabolism and mapped my raw shotgun reads to this database. So from each sample I obtained reads corresponding to PAH catabolic genes. For example:

sample 1 - 100 reads

sample 2 - 200 reads

sample 3 - 300 reads

sample 4 - 150 reads

And the question is: can I, based on the number of reads, make conclusions about the number of genes in the samples? I.e. can I say, that the sample 3 contains more PAH genes than sample 1? Also I have carried out the taxonomic analysis, but as for the genes, it's not so obvious.

Thanks.

shotgun metagenome • 1.2k views

ADD COMMENT • link updated 4.1 years ago by andres.firrincieli 3.9k • written 4.1 years ago by mewgia • 0

0

Entering edit mode

based on the number of reads, make conclusions about the number of genes in the samples?

Do you have multimapping reads, if you do then you really can't conclude anything. If the reads are mapping uniquely then you can maybe say that that subset of genes appears to be expressed (ones that have reads aligning to them) but you can't conclude anything about number of genes.

You are also aligning to a custom (reduced representation) database so aligners will try and align reads to genes in your database even if they may not have originated from those genes.

ADD REPLY • link 4.1 years ago by GenoMax 151k

0

Entering edit mode

Assuming that you have biological replicates, the best way to answer your question in literally count how many PAH catabolic genes you have in your samples. For each sample, assemble your reads in contigs with a metagenomic assembler of your choice > run prodigal in metagenome mode to detect the protein coding genes > use the protein sequence in eggNOG to identify which proteins have a KO (KEGG Orthology) ID associated with the degradation of PAH > count how many PAH KO you have in each sample:

K14579: naphthalene 1,2-dioxygenase subunit alpha   7   8   9   15
K11943: PAH dioxygenase large subunit   32  8   13  8
...

The numbers indicates the amound of CDS, e.g. K11943, identified in your samples

ADD REPLY • link 4.1 years ago by andres.firrincieli 3.9k

score 0 · Answer 1 · 2021-04-24

0

Entering edit mode

4.1 years ago

lmlukoseviciute ▴ 60

I think If you knew for example how much colony forming units you have sequenced in each sample then you can scale these values for example genes per 100 CFU, or something like this. But since the numbers here are quite close together I don't think you would find statistical difference. I suppose for that would need replicates for each sample.