Hey guys, I have just recently (9 days ago) posted about metagenomic analysis of phages. I hope it is okay, to post again right now, the issue is different this time. I am struggling a bit with deciding which BLAST parameters I should consider and which changes I could make. I did not find much by going through papers.
I have blasted my assembled contigs first with this command: "blastn -db nt_viruses -negative_gilist neg_list.gi -evalue 0.000001 -outfmt "6 qseqid saccver sscinames slen length qcovs qcovhsp pident evalue bitscore" and also without the -negative_gilist parameter.
I have created the .gi list by searching for "environmental" AND "uncultured" in NCBI Nucleotide GenBank and narrowed the results for Viruses and Genomic DNA in order to exclude environmental and uncultured samples since I had a lot of hits like "Caudovirales sp." which basically does not help me much. I figured that this however also excludes some valuable hits that do not appear to be uncultured to me when looking up the Accession numbers. Is this the right way to mimic the option "exclude Uncultured/environmental sample sequences" like you have in online BLAST?
Also I am a little bit baffled about hits that presumably return phages against Escherichia, Klebsiella, Salmonella, Yersinia etc. although I used P. aeruginosa strains as my host in the sample preparation. I also have phages against Stenotrophomonas which I can understand since these are related. But I do not know if I should just remove the hits on other bacterial genera and just keep the Pseudomonas phages or what to do. Normally there should not be a contamination and I also do not have these results in all samples. Some samples are only Pseudomonas phages.
Thank you!
Just getting a hit on a phage itself is not informative enough. The query coverages and identities matter a lot. Case point I am investigating a e-coli phage contamination and the phage that I found is only about 50% similar to known phages, and I get lots of other hits to different species as well. Once assembled half the phage is practically unknown.
Is there some sort of generally used/accepted cut-off? In the literature many papers use values like at least 50% identity of 90% query coverage.
Those numbers 50% and 90% feel very "human" oriented rather than fact-based.
Seem like numbers that are easy to remember and strong enough if you can get them. The problem is that your phage might not match anything at those criteria.
The novel phage genomes I am finding are the other way around, more 50% query coverage and 90% identity over those regions :-)
Long story short, I think the phage diversity is far larger than anticipated and assigning a phage to a species based on a partial match is less reliable than one would expect. Out of curiosity, I looked at the RefSeq representative viral database
shows
filtering for phages:
produces
Thus about a third of all viruses blast knows about are some sort of phage.
But then there are almost a million prokaryotes in RefSeq alone, so knowing about just 6 thousand phages seems like a substantial underestimation.
The bacteria you mention are common contaminants in labs and reagents...