Hello I think I have a quick question:
I received a demo dataset of fastq rna-seq data: My assignment is to find out from which organism they are. I tought of using BLAST to perform this task. I followed the biostar handbook, first converting the fastq sequences to fasta, and then using the BLAST web tool.
When entering my file with 100k sequences, it will return an error. When feeding it my first 5 sequences, I get the results for the first one, and I can then choose the results for the next files from the drop down menu.
So I understand it will present me the results for just a copule of sequences. But is there a functionality that will evaluate the results of all 100k sequences?
I don't really know if I should make an assembly, or whether I should invest in running the blast locally, and then looking at the organisms of the #1 result from each blast...?
I'd agree with using a tool like Kraken/Centrifuge. Blasting that many sequences will take a long time (I've done it), and the data that comes back is so enormous is almost impossible to do anything with it. I once BLASTed a set of sequencing reads to figure out what was in it, even with some very stringent BLAST parameters, it took a week to complete, and the output file was 19 GiB. Too unweildy to do anything with, but I was curious...
I don't know all that much about blasting and the resources it takes. You think I can figure something out that returns me the organism of the first match of each sequence? Then I'd be able to just make a report of those...
It would be good to clarify if you have any reason to believe (after the initial blasts) that there is more than one organism represented in your data.
Is it assumed that they're all from the same organism? If yes, maybe you should just subsample e.g. 10 reads and blast those?