Hey Georgina. I'm a dev on Serratus.io and by chance have been working on a Serratus Virome Tutorial
which is related to the type of search you want to do. I'm still a few days away from finishing the analysis in R, but how I would approach here. If you open an issue on our github I can update you when it's done.
E. Virome Analaysis - _Toxoplasma gondii Case Study_
[~1 day] A Virome is a set of viruses which share a common host, environmental, or ecological niche. For example one can describe the "Human Virome", a "Antarctic Soil Virome" or the "Rainforest Virome". As such it is useful to query for all Serratus database viruses which associate with common factor. This tutorial explains how to generate and execute such a search query efficiently.
As an illustrative example, this tutorial will aim to describe the Serratus-associated virome of the unicellular parasite Toxoplasma gondii(NCBI taxid: 5811).
Prerequisites
- Internet web browser
- Unix-based command-line
- pgAdmin (postgreSQL client)
- Google Cloud Account [Optional]
X Query RNA-dependent RNA Polymerase (RdRP) sequences
X Serratus microassemblies: rdrp1.mu.fa
(3.6 GB)
X DIAMOND
(v2.0.9+)
X muscle
: multiple sequence alignment
1. Defining "Virome Search Criteria"
There is not a perfect means to define a Virome Inclusion
and Exclusion
criteria. What is important is to define the criteria in a systematic, unbiased and reproducible manner.
A) SRA Taxonomy Search
Every SRA sequencing run is associated with a limited set of "run-specific meta-data" called the SRARunInfo
. This contains a column called scientific_name
which is a human-written label for the sequencing run which captures what taxonomic "organism" the dataset is from. To search for Toxoplasma gondii
(txid: 5811) we will select its taxonomic parent, the genus Toxoplasma
(txid: 5810) to include Toxoplasma sp. YC-2010a, a Toxo which is not categorized under T. gondii.
To search the SRA for such organisms, we can search the SRA:
[`txid5810[Organism:exp]`](https://www.ncbi.nlm.nih.gov/sra/?term=txid5810[Organism:exp])
which on 23/06/28
returned 2,813
datasets. Download this SraRunInfo
file (txid5810_SraRunInfo.csv
in tutorial package). Downloaded file has 1,856
matching 'runs'.
B) SRA Meta-data Search
Sometimes datasets are not labelled based on the parasite T. gondii
, instead the host organism is choosen (Homo sapiens
, Mus musculus
, ...). To search if T. gondii appears in any meta-data we can search for different spellings/synonyms. Search is exact here.
[`"Toxoplasma gondii" OR "Toxoplasma" OR "T. gondii" OR "T gondii"`](https://www.ncbi.nlm.nih.gov/sra/?term=%22Toxoplasma+gondii%22+OR+%22Toxoplasma%22+OR+%22T.+gondii%22)
which on 23/06/28
returned 6,394
datasets. Download this SraRunInfo
file (toxo_SraRunInfo.csv
). Downloaded file has 23,531
matching 'runs'.
NOTE: This type of search is more likely to yield False Positive
matches for T. gondii
, or places where your search returns synonym results. For example this Neospora caninum sample mentions "T. gondii".
C) Sequence Content Search (optional)
What if the data-generators were unaware that their sample contained T. gondii
? It's possible to search for known-organism genomes using a precomputed hashed-kmer database called NCBI STAT
. To use STAT
, you must register for a GCP
account and follow these instructions.
STAT BigQuery for libraries containing great than 50 Toxoplasma-mapping reads:
SELECT
acc,
tax_id,
name,
total_count,
self_count
FROM
nih-sra-datastore.sra_tax_analysis_tool.tax_analysis
WHERE
tax_id in (5810)
AND total_count>50
which on 23/06/28
returned 20,198
datasets. Download this statbigquery
file (txid5810_statbigquery.csv
).
The number of T.gondii
+ reads detected by STAT in each of the 20,198
returned runs. There are 12,863
runs which contain >1000 reads.
Set Comparison
- Metadata Only Dataset: SRR11160759 : Described as
10T1/2 cells treated with parasite-free lysate (mock) 2 hpi
. A control dataset from a T.gondii
infection study.
- BigQuery Only Dataset: SRR21616415 : Described as HFFs (human cells) infected with T.gondii in
Sample
description only. Meta-data not captured by SRA metadata search
- BigQuery Only Dataset: SRR870621 : Only described as
infected with type II (Pru A7) for 8 hrs
, which is a strain of T. gondii. Meta-data false negative, big-query true postive.
To err on the side of inclusion for now, we will take the Union of these sets, so 32,601
runs as representing the total T. gondii sequencing set.
D) Set Union and Virome-adjacent controls (optional)
We, unfortunately, now have two different csv
tables with slightly varying meta-data. For simplicity we will take the Union
of these sets and generate a uniform collection of meta-data so that the data can be processed uniformly.
Depending on your experimental design it may be beneficial to include a Virome adjacent
or Virome control
datasets in downstream analysis. Briefly, every SRA 'run' belongs to a greater bioproject
. One bioproject
can group together say 20 runs
, and based on meta-data our queries could return only a sub-set of a given bioproject. For example by searching for T. gondii
with STAT
, in a study containing 20 samples of cat muscle biopsies, only 2/20 may be positive for T gondii
, and therefore we would like to include the remaining 18/20 cat biopsies of that study. This will prove useful as a negative control
to deplete for non-specific viruses with respect to our query of T. gondii
.
#!/bin/bash
# Generate list of all UNIQUE SRA run id from all query sets above
cut -f1 -d',' toxo_SraRunInfo.csv > tg_run.list.tmp
cut -f1 -d',' txid5810_SraRunInfo.csv >> tg_run.list.tmp
cut -f1 -d',' txid5810_statbigquery.csv >> tg_run.list.tmp
# Remove header lines
grep -v "^Run" tg_run.list.tmp \
| grep -v "^acc" - \
| sort -du - > tg.runlist
wc -l tg.runlist
# 32601 tg.runlist
This list can now be used as an input to query for all SRA meta-data in big query
#GCP Big Query
SELECT
acc, organism, biosample, bioproject, assay_type, librarysource, libraryselection, mbases, mbytes, avgspotlen, releasedate, consent
FROM
`nih-sra-datastore.sra.metadata`
WHERE
acc in ('DRR001705', 'DRR001706', 'DRR002461', ..., SRR9997107', 'SRR9997108')
Saved as tg.SraMetadata.csv
returned 32,474
"tg-positive" SRA runs (99.6%) of the search query. From this we will create a list of all 2,586
unique bioproject
identifiers (column 4) and re-query for all related runs.
#!/bin/bash
cut -f4 -d', tg_virome.SraMetadata.csv | sort -u - > tg_bioproject.list
#GCP Big Query
SELECT
acc, organism, biosample, bioproject, assay_type, librarysource, libraryselection, mbases, mbytes, avgspotlen, releasedate, consent
FROM
`nih-sra-datastore.sra.metadata`
WHERE
bioproject in ('PRJDA33425', 'PRJDA33429', ..., 'PRJNA985929', 'PRJNA988114')
Saved as tg_adjacent.SraMetadata.csv
Which yields 668,195
total, or 635,721
"tg-negative" SRA runs. This will serve as a "negative control" for non-specific virus-associations.
2. Generating the "Target Virome" and "Off-target Virome"
Using the "Tg-postive" and "Tg-negative" runs, we will query the Serratus
SQL tables to retrieve sOTU
which were found within those runs. This molecular barcode approximates RNA virus species and can be used to meaningfully group related sequences when looking for an association.
# Serratus postgres SQL Query
SELECT * FROM public.palm_sra2
WHERE qc_pass = 'true' AND
run_id in ('DRR001701',
'DRR001702',
'DRR001703',
'DRR001704',
...
'SRR9997116',
'SRR9997117',
'SRR9997118')
ORDER BY run_id ASC
Saved as tg_adj_U_palmsra.csv
which returned 121,907
virus-run observations in 15,256
SRA runs, containing 30,828
distinct virus sOTU.
3. Virome summary statistics (to be continued)
Since you are asking about finding data I will point you to https://sra-explorer.info/. If you type in "spider" as a search term you can get hundreds of hits. These are all datasets (it looks like there is a 1000 spider transcriptome project which may or may not be of interest to you). Way you do this search is detailed in this thread : sra-explorer : find SRA and FastQ download URLs in a couple of clicks
If you have no prior experience with next-gen sequencing data then this is a complex project. Are you working on this as a thesis project? Using public sequence data comes with several caveats since the aim of the public experiment may not exactly align with your interest. You would benefit from discussing with whoever assigned this project to you about strategy and execution.