Hello! I am sorry if this question seems silly, but I wanted to ask if there is a way to limit the result that I get from running the BLAST command on my samples.
I am trying to do determine the microbial composition of environmental samples by sequencing 16S amplicons from environmental samples.
After trimming for the 16S region and doing QC, I converted the FASTQ files to FASTA and used blastn
against the NCBI's 16S Ribosomal RNA database.
After that, I import the blastn files into MEGAN and start my analysis. This is so that I can then inspect/extract the reads associated with the species/genus later.
My question comes from the fact that running thousands of reads through the blastn program lead to VERY large files, with running about 100,000 reads returning files that are more than 250 GB.
The command that I used is as follows:
blastn -db ~/NCBIdb/16S_ribosomal_RNA -query query.fasta -num_threads 12 -out query.fasta.blastn
I tried the -max_target_seqs
option with a value of 100 and compared it to the default 500, and I noticed very big changes to the bacterial composition of my sample.
This led me down the rabbit hole, with Shah et. al. and the NCBI team, and a whole lot of other searching, but I still could not find out whether using the option is advisable or not.
Thus, I was wondering if anyone had tried doing the same thing; is it better to stick to the default 500 or go for a different value? I assumed that the -max_target_seqs option would give me the best hit out of the whole database, but it seems to not be the case. Or is there another way to reduce the computational load and file size of the result? Because I have about 130 samples, all with more than 50,000 reads each.
Thank you in advance,
Adham
Edit: Added some information in an attempt to make it clearer.
Sorry, but why are you even using
BLAST
for this? I presume you're trying to align short reads? You should be using something likeKraken2
(against its 16S database) instead. You should be able to find its 16S DBs here.Thank you very much for the reply!
I'm sorry for the lack of information. I've sequenced 16S rRNA amplicons from environmental samples and have the FASTQ files as a result of the sequencing. After QCing the FASTQ files, I'm trying to determine what kind of bacteria there are in the sample, so I converted the FASTQs to FASTAs and I ran them against the 16S database to determine the bacterial species.
I'll check Kraken2 out too.
Thank you for clarifying that, and no worries!! I think
Kraken2
would be the right tool here. Good luck!!