Hello everyone, I am currently working on my final master's project, which focuses on antibiotic resistance genes in oceanic metagenomic samples. I am facing a challenge as I am using Amazon WorkSpaces, which has limited computational capacity. So far, using Megahit for contig assembly, I have been able to obtain fasta files containing approximately 10 million reads without causing AWS to crash (I am testing the limits of Megahit on this Amazon WorkSpaces, which has 2 CPU cores and 8 GB of RAM, so I might be able to assemble a few million more reads). However, the dataset I am working with is quite large. For instance, even a small sample consists of around 70 million reads.
The issue I am having is that, although I am aware that I am only analyzing 10 million reads out of a total of 70 million in this example (as there are other samples with even 300 million reads), I haven't identified any resistance genes yet (which I understand to some extent, but it surprises me that I haven't found anything at all in 10 million reads). Currently, I am using AMRFinder to detect resistance genes, and I would like to ask for your recommendation on an alternative program that I could use to analyze the fasta file obtained from Megahit and identify resistance genes in metagenomic samples. I have been considering using deepARG, which is the program used in the initial project, and for plasmids, I use PlasmidFinder. What are your thoughts on this? Thank you.
Understood, the data I'm using comes from a study of different samples in various areas of the oceans, and they obtained antibiotic resistance genes (ARG), which is why I'm trying to find them. The idea is to find bacteria with antibiotic resistance genes in samples near the coast due to human presence, as many of our waste products end up in the sea. That is the hypothesis. The study from which I extracted the data found ARG in many locations, but it's true that they had a vast amount of data and more resources than I do. I suppose I will reconsider the whole matter. Thank you very much.
If they already found ARG, do you have the sequences of these? This may be a dumb question - but do you expect any of the reads in your data set to map to these genes? (or to any known ARG?) I'm thinking of some kind of positive control. If you have the data set they used to find ARG, there should be reads in that data set that then map to ARG...at some rate. If you have a data set with 10 M reads, you could estimate what the rate might be for your subset, given their data set. Simple read mapping doesn't take many resources.