Question

Search all SRA data for sequence? Raw reads only

0

Entering edit mode

4.4 years ago

poppersrules ▴ 10

Is their a way to search raw reads in the SRA? I don't mean an individual entry but rather all raw illumina reads submitted to NCBI. I could limit it by year, but I'm looking to scan all of the raw unassembled reads in public repositories. Yes, I know this is a significant task.

I'm doing metagenomic / viral analysis and looking for evidence that might exist in the raw sequences but get thrown out in the consensus / assembly data.

SRA NCBI Raw reads fastq • 2.8k views

ADD COMMENT • link updated 7 weeks ago by Wayne ★ 2.1k • written 4.4 years ago by poppersrules ▴ 10

1

Entering edit mode

While not quite the same thing you can do a sequence search against EBI's metagenomic dataset MGnify (LINK).

ADD REPLY • link 4.4 years ago by GenoMax 152k

0

Entering edit mode

Could you take the statistical sampling approach by considering all entries, and simply pick randomly? A queue of randomly drawn samples could be put through a pipeline, and if you expect to find something 1 in 100, or 1 in 1000, or whatever, you could get some sense for how long you'd have to run your pipeline before having some confidence that you are, or are not, likely to find what you're looking for.

ADD REPLY • link 4.4 years ago by seidel 11k

0

Entering edit mode

How do you randomly sample the SRA? I can scan the query sample for unique sequences and simplify that way, but I'm not sure how to subsample the SRA.

ADD REPLY • link 4.4 years ago by poppersrules ▴ 10

0

Entering edit mode

As described, your project involves finding some type of sequence that might exist in public SRA submissions and you'd like to scan them all to see, but this is hard. Let's say that Martians exist, I have some Martian DNA sequence, and I want to know if Martian DNA exists in any SRA submissions - I have reason to believe that Martians visited 10 labs, and may have contaminated 100 samples present in the SRA. if there are 800,000 SRA samples (I have no idea how many there are), I can simply put all SRA ids in a bucket from which to draw ids at random, and use the SRA toolkit (i.e. fastq-dump) to grab fastq data one entry at a time and examine say the first million reads for my sequence of interest. If I don't find it, select another SRA record at random, and repeat. Using SRA in the cloud, one could set up a script to churn through records, and if something exists at the rate listed above (100 within 800,000) I should get a hit within 8000 trials. I could also parse all the SRA records up front to put those I suspect might have the highest likelihood of containing my sequence towards the front of the pack. That's all I mean, if your sequence exists in x records, then by sampling records in random order you may eventually find it - the slow part, of course, is that you have to continually download data until you find a positive match (made easier by SRA in the cloud approach). But like I said, I have no idea what you're really trying to do, or what your limitations are. I doubt you're searching for Martians.

ADD REPLY • link 4.4 years ago by seidel 11k

score 0 · Answer 1 · 2021-02-27

0

Entering edit mode

4.4 years ago

Mensur Dlakic ★ 29k

Have you tried BLASTn from the NCBI web site? Simply select SRA as a target database from the drop-down menu.

enter image description here

ADD COMMENT • link 4.4 years ago by Mensur Dlakic ★ 29k

score 0 · Answer 2 · 2021-02-27

0

Entering edit mode

4.4 years ago

GenoMax 152k

I don't mean an individual entry but rather all raw illumina reads submitted to NCBI.

There is no realistic way to do that. You can blast against select SRA accessions via BLAST web page as shown by @Mensur.

Current size of SRA as of Feb 2021

enter image description here

ADD COMMENT • link 4.4 years ago by GenoMax 152k

score 0 · Answer 3 · 2025-06-04

There's now Logan Search. (presently it covers through the end of 2023)

"Given a DNA sequence, the service replies in a few minutes in which SRA accession(s) it is likely to occur. ... In more technical depth, the search engine uses kmindex, a k-mer based sequence search tool that uses Bloom filters. It was applied to construct an index over all genome assemblies of all of SRA, more specifically over the unitigs of Logan."

There's a talk about it available here.
The associated GitHub repo: Logan.

I put together a set of Jupyter Notebooks demonstrating using Python & Jupyter to assist in analyzing the results you get returned. You can run these notebooks in a MyBinder-provided remote session without installing a thing on your machine or signing up for anything by going here and clicking on a 'launch binder' badge. Save anything useful back to your local machine promptly as the remote session is temporary.