I'm quite new to bioinformatics and am currently running into a dilemma on how to approach my re-mapping.
I have successfully assembled my FASTQ-files (reads) and now have my contigs in FASTA files.
From these FASTA-files, I want to predict its genes (with Prodigal) and find the depth (abundance) of the predicted genes. However, I am unsure if I should re-map my contigs back to my reads first, or if I should gene predict first, and then re-map the predicted genes back to the reads. As I see it, by re-mapping the predicted genes, I risk losing some of the reads not entirely in the gene. However, by re-mapping the contigs, I get a depth for all the reads on the contig, but will then have to correlate that data to the predicted genes somehow. Do you have any suggestions or preferences? And if you do, why would you choose one over the other.
Can you explain a bit further? There may be some terminology confusion here.
Firstly, what exactly do you mean by depth/abundance of genes? Is this copy number of the genes, or number of reads corresponding to those sequences etc. "Depth" of gene coverage would only really make sense in the context of an RNAseq/transcriptomic experiment.
Secondly, don't map your contigs to your reads. That'll take forever. You map reads to contigs, but there may not be any need to do this.
Thirdly, I would suggest using prokka or something (a workflow) rather than 'bare' prodigal.
Fourthly, I would suggest you gene predict first, then its not super hard to work out which reads are mapping within the index boundaries of the gene. I'm not sure if there's a ready made tool for this, but samtools and bedtools would probably get you most of the way there.
Thanks for your answer!
What I have is a metagenomics sample. I want to look at the various genes (from bacteria) in the sample. I have assembled my contigs and now want to predict the genes from the contigs. But, I also want to figure out the depth of the genes (how much is the gene represented in the metagenomics sample). And by that, I guess I mean how many times is each read represented in the gene (and then some sort of mean value of all the read depths, which will then be my gene depth), correct me if I'm wrong. If there's a better way a looking at the quantity of each of the genes, I'd be happy to know!
And sorry my bad, I meant map reads to contigs in that phrase. I will take a look at Prokka too, thanks.
If I gene predict first, will some of the reads not be "lost", if they are not entirely on the gene?
What you're describing is a pangenome study really I think.
Are you really interested in actual frequency of the gene, or just in how many organisms in the same have it (and/or have it more than once)?
i.e. if there's a billion E. coli's in your sample, do you want to know if there are 2 billion copies of that gene, or do you just want to know how many strains of E. coli have one or more copies?
I want to know the frequency of the genes, so that I can make some statistical analysis on the genes (gene distribution in percentage etc.). Actually, I'm not that interested in what bacteria the genes come from. So It doesn't really matter if the gene is present in different kinds of bacteria, its just how much that gene is represented in the entire sample.
Are you interested in all variants separately, or are you intending to threshold some similarity scores?
i.e. how different can the sequence be in different mutants and strains before its no longer 'the same gene'.
Can you also tell us how you assembled all the genomes? I think this is a more substantial task than you (or I) perhaps realised initially - its quite a tricky problem with lots of potential pitfalls.
That's of course something I must consider. I will definitely have to have a threshold, as I am basically only interested in genes with roughly the same function.
The reads were assembled with metaSpades. I haven't binned the contigs yet, as I am not sure it would be relevant. Yes, I think so too.
I want to know the frequency of the genes, so that I can make some
statistical analysis on the genes (gene distribution in percentage
etc.).
Are you only interested in knowing how many GAPDH (as an example besides hundreds of other genes you will find) copies there are in the assembly you created and then predicted genes on? This should not need read mapping.
In a metagenomic study like this similar reads from multiple organisms (and thus genes) will collapse into smaller representation than real number of copies present.
Well basically yes. The number of all the genes in the sample. How would you approach it?
I am not too worried about that, as I just want a rough estimate. I'm comparing different samples, which have had something done to them, so I guess that will even itself out somehow.
I think the assemble and remap approach is fine (this is how you'd do transcriptomics). However, since you're only getting once assembly per set of reads for any given genome, I think what you may need to try is a very strict assembly (few if any mismatches allowed). This is then treated as a reference genome for all the variants of that gene (up to some limit).
Are you binning the genomes first? Can you tell us more about how you did the assembly?
You will then have to do some fairly relaxed remapping, but potentially normalised to ensure complete coverage of the entire gene (i.e. if you have 300X coverage of the 3' end, but only 15X coverage of the 5' end, this could be artifactual).
You could also extract all genes from all of the genomes and create a database of genes, which you could use as a database to align back against potentially. Though I'm mostly just thinking aloud at this point so not 100% sure it'd work.
I think there may be some residual confusion, but if we're all on the same page, it does sound like a pangenome analysis problem.
If you use a metagenome-aware assembler, to retrieve representatives of every genome in each of your samples, you can then run them through an annotator such as prokka (if bacterial).
Take all of your annotated genomes, and screen them with pangenome analysis tools such as justorthologs or roary, and then you'll get a count of how many genes there are in a given ortholog group. I would suggest you allow for paralogues within orthologue groups to get a full count. Your count of genes is then simply how many entries there are in that ortholog group.
I'm not familiar with every pangenome/orthology tool, but roary for instance, is only useful for quite closely related things, which I suspect wont be the case in your metagenomes. You will need to look for an orthology tool which can work on distantly related things most likely. I remember some do exist, but all the names escape me now.
I think you might be right. What exactly do you mean by "representatives"?
So you would disregard the idea of re-mapping altogether?
And thanks for your time !
Representatives in this context meaning “one E. coli sequence that serves as a reference for all reads that are >n amount similar”.
You can still use the reads to find SNP variants or something (it might give you more resolution between specific gene sequences), but whether this is needed or not depends how ‘inclusive’ you want gene clustering to be (I.e. are you interested in every single variant of every single gene, or do you only care if they’re more different than some threshold?). If you don’t need to know every single possible variant, you won’t need to remap all the reads.
Can you explain a bit further? There may be some terminology confusion here.
Firstly, what exactly do you mean by depth/abundance of genes? Is this copy number of the genes, or number of reads corresponding to those sequences etc. "Depth" of gene coverage would only really make sense in the context of an RNAseq/transcriptomic experiment.
Secondly, don't map your contigs to your reads. That'll take forever. You map reads to contigs, but there may not be any need to do this.
Thirdly, I would suggest using
prokka
or something (a workflow) rather than 'bare'prodigal
.Fourthly, I would suggest you gene predict first, then its not super hard to work out which reads are mapping within the index boundaries of the gene. I'm not sure if there's a ready made tool for this, but
samtools
andbedtools
would probably get you most of the way there.Thanks for your answer! What I have is a metagenomics sample. I want to look at the various genes (from bacteria) in the sample. I have assembled my contigs and now want to predict the genes from the contigs. But, I also want to figure out the depth of the genes (how much is the gene represented in the metagenomics sample). And by that, I guess I mean how many times is each read represented in the gene (and then some sort of mean value of all the read depths, which will then be my gene depth), correct me if I'm wrong. If there's a better way a looking at the quantity of each of the genes, I'd be happy to know! And sorry my bad, I meant map reads to contigs in that phrase. I will take a look at Prokka too, thanks.
If I gene predict first, will some of the reads not be "lost", if they are not entirely on the gene?
Thanks!
What you're describing is a
pangenome
study really I think.Are you really interested in actual frequency of the gene, or just in how many organisms in the same have it (and/or have it more than once)?
i.e. if there's a billion E. coli's in your sample, do you want to know if there are 2 billion copies of that gene, or do you just want to know how many strains of E. coli have one or more copies?
I want to know the frequency of the genes, so that I can make some statistical analysis on the genes (gene distribution in percentage etc.). Actually, I'm not that interested in what bacteria the genes come from. So It doesn't really matter if the gene is present in different kinds of bacteria, its just how much that gene is represented in the entire sample.
Are you interested in all variants separately, or are you intending to threshold some similarity scores?
i.e. how different can the sequence be in different mutants and strains before its no longer 'the same gene'.
Can you also tell us how you assembled all the genomes? I think this is a more substantial task than you (or I) perhaps realised initially - its quite a tricky problem with lots of potential pitfalls.
That's of course something I must consider. I will definitely have to have a threshold, as I am basically only interested in genes with roughly the same function.
The reads were assembled with metaSpades. I haven't binned the contigs yet, as I am not sure it would be relevant. Yes, I think so too.
Are you only interested in knowing how many GAPDH (as an example besides hundreds of other genes you will find) copies there are in the assembly you created and then predicted genes on? This should not need read mapping.
In a metagenomic study like this similar reads from multiple organisms (and thus genes) will collapse into smaller representation than real number of copies present.
Well basically yes. The number of all the genes in the sample. How would you approach it?
I am not too worried about that, as I just want a rough estimate. I'm comparing different samples, which have had something done to them, so I guess that will even itself out somehow.
I think the assemble and remap approach is fine (this is how you'd do transcriptomics). However, since you're only getting once assembly per set of reads for any given genome, I think what you may need to try is a very strict assembly (few if any mismatches allowed). This is then treated as a reference genome for all the variants of that gene (up to some limit).
Are you binning the genomes first? Can you tell us more about how you did the assembly?
You will then have to do some fairly relaxed remapping, but potentially normalised to ensure complete coverage of the entire gene (i.e. if you have 300X coverage of the 3' end, but only 15X coverage of the 5' end, this could be artifactual).
You could also extract all genes from all of the genomes and create a database of genes, which you could use as a database to align back against potentially. Though I'm mostly just thinking aloud at this point so not 100% sure it'd work.
Possibly. Assuming you are using roughly equal number of reads for the assemblies. Sample/library prep was done in an identical manner. Otherwise no.