Hi,
First of all, the goal of the analysis is to identify nucleotide sequences exclusive of Pseudomonas. It's a wide range of organisms and I'm aware of that. Unfortunately, the person who asked me to do this did not give me proper instructions or define a specific organism.
Since exclusive nt sequences could be found in the whole genome and not only on CDSs, I'm trying to figure out how to analyze the whole genome sequence for this purpose. I've done this before but only working with CDSs and proteins. I came up with an idea to use local blastn to blast search the whole genome against the Pseudomonas nt database. Firstly, the blastn search would find the hits and I would filter out the non-hits; secondly, a file containing the hits would be blasted against the bacteria nr/nt database to find the hits and non-hits and then I would filter out the hits (since they imply sequence similarity with other organisms). Correct me if this method is wrong.
If the above method is okay, then the problem is that I can't figure out how to check the nucleotide sequences that aligned with sequences in the blast database. Remember, this is a complete record with no annotation since I want to analyze the whole genome and not only the coding regions. I'm currently trying to play with the -outfmt parameter but this will take a while since each analysis take quite a time. Web BLAST is out of the question since the analysis is too CPU intensive (resulting in CPU usage limit on web BLAST)
If anyone knows how to solve this, I ask you kindly to show me the way. Thanks.
Just to be clear. You are trying to find sequences exclusive to Pseudomonas as compared to rest of GenBank?
But then you say this
What is the query here?
The query is Pseudomonas aeruginosa PAO1 (reference genome). The query will be blasted against the nt database. The idea is that running P. aeruginosa against the Pseudomonas genus will identify sequences within the genus, and running against bacteria (excluding Pseudomonas) will ensure the sequences are exclusive to Pseudomonas. I thought about running a core genome analysis, but I don't think I have the equipment for it (too CPU intensive). Analyses against the nr/nt database includes sequences from GenBank+EMBL+DDBJ+PDB+RefSeq. I'm also using BLAST DBv5 to facilitate the inclusion or exclusion of taxids.
I am having trouble understanding how you think this is going to work. How are you deciding thresholds for sequence similarity?
nt
contains diverse sequences so most of PAO1 genome wouldhit
something in the database. Excluding Pesudomonas genomes from other bacterial genomes would still generate plenty of hits to genomes that are taxonomically similar.Assuming you did identify sequences that are Pesudomonas specific what do you want to use them for?
Have you considered using
sourmash
(https://sourmash.readthedocs.io/en/latest/command-line.html ) to manage genomes that could go into these comparisons.The idea is to use E-value and bit-score to evaluate sequence similarity and then filter out the sequences from the output file. By running BLAST against nt, we could select the non-hit sequences and predict them as Pseudomonas-specific. Before that, as I said, we would BLAST PAO1 against the genus Pseudomonas (and these sequences will be analyzed against nt). I'm not sure this will work since I've never done something in this scale. The goal is to use Pseudomonas-specific regions for primer design. Again, the person who asked me this did not give me enough information. This person's idea is based on an article that analyzes the presence of bacterial sequences in a diversity of cancers. According to this article, many bacteria reads were found in samples of cancer (could just be contamination), including Pseudomonas-like, Acinetobacter-like, Rodentia spp., H. pylori, and others. Since the article is so vague, I was asked to idenfity Pseudomonas-specific sequences. It would be much easier if it was just one species. Basically, the final objective is to design primers for Pseudomonas
Start there and see what you get. I am afraid you are going to get some hits to something until you start playing with search stringency. Then the sequence with "no hits" may not necessarily be Pseudomonas specific. It would be just an artifact of your search.
If you suspect your cancer data to contain bacterial sequences it may be good to start there. Use something like
bbsplit
from BBMap suite to bin host reads away and then use the remaining reads to search againstnt
ornr
bacteria. If you consistently see something than it may be one avenue to go forward.Thank you. I think this analysis I proposed is extremely intensive both for normal computers and one person. I'm trying another strategy. After reading more about primers (I'm not that familiar with primer design, just know the basics), I think the person who asked me to do this wants a universal primer for Pseudomonas. Since this is going a little off-topic, I'm going to open a new post.