Is their a way to search raw reads in the SRA? I don't mean an individual entry but rather all raw illumina reads submitted to NCBI. I could limit it by year, but I'm looking to scan all of the raw unassembled reads in public repositories. Yes, I know this is a significant task.
I'm doing metagenomic / viral analysis and looking for evidence that might exist in the raw sequences but get thrown out in the consensus / assembly data.
While not quite the same thing you can do a sequence search against EBI's metagenomic dataset
MGnify
(LINK).Could you take the statistical sampling approach by considering all entries, and simply pick randomly? A queue of randomly drawn samples could be put through a pipeline, and if you expect to find something 1 in 100, or 1 in 1000, or whatever, you could get some sense for how long you'd have to run your pipeline before having some confidence that you are, or are not, likely to find what you're looking for.
How do you randomly sample the SRA? I can scan the query sample for unique sequences and simplify that way, but I'm not sure how to subsample the SRA.
As described, your project involves finding some type of sequence that might exist in public SRA submissions and you'd like to scan them all to see, but this is hard. Let's say that Martians exist, I have some Martian DNA sequence, and I want to know if Martian DNA exists in any SRA submissions - I have reason to believe that Martians visited 10 labs, and may have contaminated 100 samples present in the SRA. if there are 800,000 SRA samples (I have no idea how many there are), I can simply put all SRA ids in a bucket from which to draw ids at random, and use the SRA toolkit (i.e. fastq-dump) to grab fastq data one entry at a time and examine say the first million reads for my sequence of interest. If I don't find it, select another SRA record at random, and repeat. Using SRA in the cloud, one could set up a script to churn through records, and if something exists at the rate listed above (100 within 800,000) I should get a hit within 8000 trials. I could also parse all the SRA records up front to put those I suspect might have the highest likelihood of containing my sequence towards the front of the pack. That's all I mean, if your sequence exists in x records, then by sampling records in random order you may eventually find it - the slow part, of course, is that you have to continually download data until you find a positive match (made easier by SRA in the cloud approach). But like I said, I have no idea what you're really trying to do, or what your limitations are. I doubt you're searching for Martians.