I am new to bioinformatics and have a question about potentially using the Sequence Reach Archive for taxonomic classification to improve the amount of classified reads obtained from a sample using WGS. I am not sure if this is a good idea, or even possible since it is more of an archive after all, it seems most people download single sequences from the SRA to use in experiments.
Here is a general explanation of this class project I working on for an intro class to bioinformatics: I have some metagenomic data from one seabird gut microbiome sample that was sequenced using Illumina MiSeq (paired-end mode, length approx. 400-500 bp (300+300 bp with ~100-200 bp overlap), it was already analyzed using this workflow: https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-Standard-Operating-Procedure-v3. According to this workflow, they use Kraken2+Bracken and the NCBI Complete RefSeq database. The amount of classified reads that came back from the analysis was around 5%, extremely low!
So for the class project I want to try and increase the number of classified reads for this sample. At the advice of my instructor, I was going to try and use the Sequence Read Archive as an alternative database to classify the reads taxonomically. Since there is a larger amount of data in that archive and will probably give a higher level of classification, is this even possible to do?
Note: the Github linked above they recommended using their VM, VirtualBox (Microbiome Helper VBox), to run small-scale metagenomic analyses on your laptop, so I started my analysis by running through the code provided on their Github. Now that I've come to the step where I have to classify these reads to a new reference database, I'm stuck.
Seeing as SRA is such a large database, I was trying to learn how to download their databases through AWS, but I would have to do this through their virtual machine to run their code even if it is possible. Since I am so new to this I am having trouble understanding how I can proceed now, if anyone has any advice I would really appreciate it. Thank you!
Oops, sorry - it is Illumina NextSeq. The Kraken2 Complete RefSeq database was originally used, so it is the larger database being used not minikraken.
Thank you for your answer!