Question

BLAST hits on viruses relate to different host than used

0

Entering edit mode

21 months ago

DonPhager • 0

Hey guys, I have just recently (9 days ago) posted about metagenomic analysis of phages. I hope it is okay, to post again right now, the issue is different this time. I am struggling a bit with deciding which BLAST parameters I should consider and which changes I could make. I did not find much by going through papers.

I have blasted my assembled contigs first with this command: "blastn -db nt_viruses -negative_gilist neg_list.gi -evalue 0.000001 -outfmt "6 qseqid saccver sscinames slen length qcovs qcovhsp pident evalue bitscore" and also without the -negative_gilist parameter.

I have created the .gi list by searching for "environmental" AND "uncultured" in NCBI Nucleotide GenBank and narrowed the results for Viruses and Genomic DNA in order to exclude environmental and uncultured samples since I had a lot of hits like "Caudovirales sp." which basically does not help me much. I figured that this however also excludes some valuable hits that do not appear to be uncultured to me when looking up the Accession numbers. Is this the right way to mimic the option "exclude Uncultured/environmental sample sequences" like you have in online BLAST?

Also I am a little bit baffled about hits that presumably return phages against Escherichia, Klebsiella, Salmonella, Yersinia etc. although I used P. aeruginosa strains as my host in the sample preparation. I also have phages against Stenotrophomonas which I can understand since these are related. But I do not know if I should just remove the hits on other bacterial genera and just keep the Pseudomonas phages or what to do. Normally there should not be a contamination and I also do not have these results in all samples. Some samples are only Pseudomonas phages.

Thank you!

Phage BLAST Metagenomics • 1.9k views

ADD COMMENT • link 21 months ago by DonPhager • 0

1

Entering edit mode

Just getting a hit on a phage itself is not informative enough. The query coverages and identities matter a lot. Case point I am investigating a e-coli phage contamination and the phage that I found is only about 50% similar to known phages, and I get lots of other hits to different species as well. Once assembled half the phage is practically unknown.

ADD REPLY • link 21 months ago by Istvan Albert 103k

0

Entering edit mode

Is there some sort of generally used/accepted cut-off? In the literature many papers use values like at least 50% identity of 90% query coverage.

ADD REPLY • link 21 months ago by DonPhager • 0

1

Entering edit mode

Those numbers 50% and 90% feel very "human" oriented rather than fact-based.

Seem like numbers that are easy to remember and strong enough if you can get them. The problem is that your phage might not match anything at those criteria.

The novel phage genomes I am finding are the other way around, more 50% query coverage and 90% identity over those regions :-)

Long story short, I think the phage diversity is far larger than anticipated and assigning a phage to a species based on a partial match is less reliable than one would expect. Out of curiosity, I looked at the RefSeq representative viral database

blastdbcmd -db ~/refs/blastdb/ref_viruses_rep_genomes -entry all -outfmt "%t" | wc -l

shows

filtering for phages:

blastdbcmd -db ~/refs/blastdb/ref_viruses_rep_genomes -entry all -outfmt "%t" | grep phage | wc -l

produces

Thus about a third of all viruses blast knows about are some sort of phage.

But then there are almost a million prokaryotes in RefSeq alone, so knowing about just 6 thousand phages seems like a substantial underestimation.

ADD REPLY • link 21 months ago by Istvan Albert 103k

0

Entering edit mode

The bacteria you mention are common contaminants in labs and reagents...

ADD REPLY • link 21 months ago by Brian Bushnell 20k

score 2 · Accepted Answer · 2023-11-24

Pretty much what Istvan Albert already told you, with a couple of concrete suggestions.

I don't think it is a sound strategy to exclude environmental and uncultured samples. By doing that you are artificially removing entries that may be most similar to your sample, and any conclusion you derive from that analysis is not likely to be rigorous. This should not be based on convenience where you remove entries that are not informative. Instead, I think you take the information as is and try to make the most sense. If the closest entry to your viral genome is some poorly characterized Caudovirales, that's what it is. That indirectly shows that a viral genome in question is likely to be novel. If, on the other hand, you remove the inconvenient entries and the top match ends up being a virus you can place better but is not all that related to the query, that could lead to a wrong conclusion. This type of reasoning also explains why you are getting Escherichia, Klebsiella, Salmonella and Yersinia phages as your top hits. If the sequence space of Pseudomonas phages is poorly characterized, it would argue that the genomes you have are novel, and will have top hits outside of their immediate relatives - because those relatives don't yet exist in the known viral sequence space.

I would say that viruses that are 70+% identical over large sequence regions are pretty much guaranteed to be related, and that cutoff may need to go lower as viruses typically diverge much faster than cellular organisms. If you were to compare your entries to the full viral database and can't find anything with at least 50% coverage and 50% sequence identity to them, I don't think many people would argue if you claimed those to be novel. These cutoffs may need to be adjusted, though. Finally, there is a recent program that will automatically classify most viral contigs, though that doesn't necessarily mean that their host will be immediately obvious.

https://portal.nersc.gov/genomad/