I was recently given 4 paired-end .fastq files (where each read has about 150 bases) extracted from dolphin stomach, each suspected to be a different strain of bacteria. I am being asked to analyze these datasets. The researchers seem to expect these 4 samples to likely be from the same genus (Helicobacter). One sample maps pretty consistently to H. cetorum MIT 00-7128 using numerous software (BLAST, sourmash, Kraken2, etc.). However, the other three samples do not map as well.
I recently wanted to use DIAMOND to further investigate these data, as was also suggested to me in a previous post. I am unable to run DIAMOND locally due to space constraints. However, I recently attempted to do so using Galaxy. I did the following (both "Diamond makedb" and "Diamond" were under the "Metagenomics analysis" tab):
1) Ran "Diamond makedb" on on the raw reads (Sample1_R1.fasta)
2) Ran "Diamond" with the following fields:
a) What do you want to align? (I selected "Align DNA query sequences (blastx)")
b) Input query file in FASTA or FASTQ format (I input Sample1_R1.fasta)
c) Will you select a reference genome from your history or use a built-in index? (I selected "Use one from the history" and input the makedb output from Step 1)
I did not change any of the defaults. Most notably, this meant I used the "Standard code". Unless I am reading the output incorrectly, I did not have any hits. There was no error in "stderr". And in "stdout", it read: "Reported 0 pairwise alignments, 0 HSSPs. 0 queries aligned."
I am a bit surprised about this, especially for the first sample which had mapped reasonably well in other software. I am not too experienced working with bacterial data and/or metagenomic data and wanted to seek advice from people with more experience: Is there an aspect of my pipeline that could be causing the zero alignment rate that you may recommend changing to analyze the data more efficiently with DIAMOND? Thank you for sharing your ideas.
Aligning your own reads to your own data using DIAMOND is not going to be beneficial. You will need to use one of the NCBI sequence datasets preferably
nr
since you are interested in identifying what is in your data. That means you are aligning against all sequencesI doubt galaxy will allow you to do this. You will need to find local hardware.
nr
database is large and it will take a while to create DIAMOND indexes for it (you will need to make your own) and then align your data. Think on order of hours for each operation depending on hardware you have access to.As a test you could upload the Helocobacterium cetorum proteome to galaxy. Make a
DIAMOND
index from it and then align your own fastq data to it.