BLAST hits on viruses relate to different host than used
1
0
Entering edit mode
12 months ago
DonPhager • 0

Hey guys, I have just recently (9 days ago) posted about metagenomic analysis of phages. I hope it is okay, to post again right now, the issue is different this time. I am struggling a bit with deciding which BLAST parameters I should consider and which changes I could make. I did not find much by going through papers.

I have blasted my assembled contigs first with this command: "blastn -db nt_viruses -negative_gilist neg_list.gi -evalue 0.000001 -outfmt "6 qseqid saccver sscinames slen length qcovs qcovhsp pident evalue bitscore" and also without the -negative_gilist parameter.

I have created the .gi list by searching for "environmental" AND "uncultured" in NCBI Nucleotide GenBank and narrowed the results for Viruses and Genomic DNA in order to exclude environmental and uncultured samples since I had a lot of hits like "Caudovirales sp." which basically does not help me much. I figured that this however also excludes some valuable hits that do not appear to be uncultured to me when looking up the Accession numbers. Is this the right way to mimic the option "exclude Uncultured/environmental sample sequences" like you have in online BLAST?

Also I am a little bit baffled about hits that presumably return phages against Escherichia, Klebsiella, Salmonella, Yersinia etc. although I used P. aeruginosa strains as my host in the sample preparation. I also have phages against Stenotrophomonas which I can understand since these are related. But I do not know if I should just remove the hits on other bacterial genera and just keep the Pseudomonas phages or what to do. Normally there should not be a contamination and I also do not have these results in all samples. Some samples are only Pseudomonas phages.

Thank you!

Phage BLAST Metagenomics • 1.2k views
ADD COMMENT
1
Entering edit mode

Just getting a hit on a phage itself is not informative enough. The query coverages and identities matter a lot. Case point I am investigating a e-coli phage contamination and the phage that I found is only about 50% similar to known phages, and I get lots of other hits to different species as well. Once assembled half the phage is practically unknown.

ADD REPLY
0
Entering edit mode

Is there some sort of generally used/accepted cut-off? In the literature many papers use values like at least 50% identity of 90% query coverage.

ADD REPLY
1
Entering edit mode

Those numbers 50% and 90% feel very "human" oriented rather than fact-based.

Seem like numbers that are easy to remember and strong enough if you can get them. The problem is that your phage might not match anything at those criteria.

The novel phage genomes I am finding are the other way around, more 50% query coverage and 90% identity over those regions :-)

Long story short, I think the phage diversity is far larger than anticipated and assigning a phage to a species based on a partial match is less reliable than one would expect. Out of curiosity, I looked at the RefSeq representative viral database

blastdbcmd -db ~/refs/blastdb/ref_viruses_rep_genomes -entry all -outfmt "%t" | wc -l

shows

18584

filtering for phages:

blastdbcmd -db ~/refs/blastdb/ref_viruses_rep_genomes -entry all -outfmt "%t" | grep phage | wc -l

produces

6084

Thus about a third of all viruses blast knows about are some sort of phage.

But then there are almost a million prokaryotes in RefSeq alone, so knowing about just 6 thousand phages seems like a substantial underestimation.

ADD REPLY
0
Entering edit mode

The bacteria you mention are common contaminants in labs and reagents...

ADD REPLY
2
Entering edit mode
12 months ago
Mensur Dlakic ★ 28k

Pretty much what Istvan Albert already told you, with a couple of concrete suggestions.

I don't think it is a sound strategy to exclude environmental and uncultured samples. By doing that you are artificially removing entries that may be most similar to your sample, and any conclusion you derive from that analysis is not likely to be rigorous. This should not be based on convenience where you remove entries that are not informative. Instead, I think you take the information as is and try to make the most sense. If the closest entry to your viral genome is some poorly characterized Caudovirales, that's what it is. That indirectly shows that a viral genome in question is likely to be novel. If, on the other hand, you remove the inconvenient entries and the top match ends up being a virus you can place better but is not all that related to the query, that could lead to a wrong conclusion. This type of reasoning also explains why you are getting Escherichia, Klebsiella, Salmonella and Yersinia phages as your top hits. If the sequence space of Pseudomonas phages is poorly characterized, it would argue that the genomes you have are novel, and will have top hits outside of their immediate relatives - because those relatives don't yet exist in the known viral sequence space.

I would say that viruses that are 70+% identical over large sequence regions are pretty much guaranteed to be related, and that cutoff may need to go lower as viruses typically diverge much faster than cellular organisms. If you were to compare your entries to the full viral database and can't find anything with at least 50% coverage and 50% sequence identity to them, I don't think many people would argue if you claimed those to be novel. These cutoffs may need to be adjusted, though. Finally, there is a recent program that will automatically classify most viral contigs, though that doesn't necessarily mean that their host will be immediately obvious.

https://portal.nersc.gov/genomad/

ADD COMMENT
0
Entering edit mode

Thank you very much for giving such in-depth answers Istvan Albert and Mensur Dlakic! I have actually also looked for some numbers on how many phage sequences are even in the db and was wondering about those small numbers. I am also using the new "nt_viruses" databank. As I have just recently started working in this field I thought this would be reasonable to choose. I do not know what your opinion is on this db?

This is the description of the db: "The Viruses nucleotide collection consists of GenBank+EMBL+DDBJ+PDB+RefSeq sequences, but excludes EST, STS, GSS, WGS, TSA, patent sequences as well as phase 0, 1, and 2 HTGS sequences and sequences longer than 100Mb. The database is non-redundant. Identical sequences have been merged into one entry, while preserving the accession, GI, title and taxonomy information for each entry."

Also I have decided to not filter out environmental & uncultured samples as I realized that this also filters out some interesting hits, that, after doing some research in the GeneBank, were not even uncultured and environmental samples. So I would have lost some info here. But actually I also had some hits on the Caudovirales sp. as you mentioned @mensur. As you mentioned this does not give much information but I will have to see what I make from that. They are not the only hits with high qcov and %identity for my sequences.

But you made me realize that these cutoffs, I read about so often, are not to just be copied but rather I should estimate which parameters make sense for my individual case!

ADD REPLY

Login before adding your answer.

Traffic: 2282 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6