Hello all,
Introduction
I am a recent graduate working for a Public Health Laboratory. I'm relatively new to bioinformatics, and most of what I know is based around NGS analysis. My lab director loves to challenge me. He wants to know different ways NGS (we have a MiSeq) could be implemented in our lab as an outbreak investigation tool.
I am aiming to do a study of Carbapenem-Resistant Enterobacteriaceae (CRE). The main goal is to be able to receive a CRE sample and use NGS to detect the genes (beta-lactamases) that are responsible. The idea is to be able to run quick analysis, while also compiling genetic information that could be used to connect the dots in an outbreak investigation (Phylogeny).
What I already Know
- The genes that I am looking for can be found within the bacterial chromosome, or within its plasmids
- For each gene, primers need to be designed for them.
The Actual Questions
First: If I wish to take the whole-genome-sequencing approach, how would I be able to tell which parts of my output (FastQ) are plasmids vs. which parts are chromosome?
Second: If I didn't want to do whole-genome, would it be possible to only sequence the genes that I'm looking for (if they are there)? And if so, how would I do it?
Open for Discussion
If anyone has any suggestions, solutions, or wishes to point me in a direction where I can learn more, please let me know. It would be a huge help, and is deeply appreciated.
Edit: Solution Found
Thanks to those who commented before, I know have a better understanding on how this all works. Also, it put me on a path to find an example of how this type of experiment is done in a clinical laboratory. You can find the study here.
One of my co-workers tried using machine learning to distinguish between plasmids and main genome based on the genes present after annotation, with some degree of success. This is much easier on assembled contigs than raw reads, which are usually too short for annotation. It should also be theoretically possible to analyze the graph structure during assembly to determine which contigs are co-located and the size of the chromosome they are located on. This can also be done after the fact using a graph file that some assemblers produce.
You can certainly try selectively amplifying the genes in question with the correct primers, but I think WGS is probably simpler and more robust. You can assemble the reads and then compare the contigs to your genes in question, or simply map the raw reads to the genes in question; either works. The MiSeq has sufficient capacity to sequence 30+ bacteria per run with 40x coverage, depending on the run mode (that's in 24 hours at 2x150bp).
I'm starting my masters program soon, and machine learning is something I am really interested in learning. Any resources you could recommend on the subject? Also, based on your response (as well as Harold's), it seems like using WGS is what will make the most sense. Selectively amplifying genes might be something I might try later down the line, but I know I'm just not there yet.
Thanks for the response, it has put me on an avenue of progressive learning.
@Brian do you have a link to the tool? I'm interested in attempting a similar classification problem so would like to see the approach.
No, the tool was never finished or made public, sorry. Though I will ask my co-worker about the status and results and report back if there's anything interesting to note.