Is it possible to use local blastn to obtain the aligned sequences of complete record files?
0
0
Entering edit mode
5.5 years ago

Hi,

First of all, the goal of the analysis is to identify nucleotide sequences exclusive of Pseudomonas. It's a wide range of organisms and I'm aware of that. Unfortunately, the person who asked me to do this did not give me proper instructions or define a specific organism.

Since exclusive nt sequences could be found in the whole genome and not only on CDSs, I'm trying to figure out how to analyze the whole genome sequence for this purpose. I've done this before but only working with CDSs and proteins. I came up with an idea to use local blastn to blast search the whole genome against the Pseudomonas nt database. Firstly, the blastn search would find the hits and I would filter out the non-hits; secondly, a file containing the hits would be blasted against the bacteria nr/nt database to find the hits and non-hits and then I would filter out the hits (since they imply sequence similarity with other organisms). Correct me if this method is wrong.

If the above method is okay, then the problem is that I can't figure out how to check the nucleotide sequences that aligned with sequences in the blast database. Remember, this is a complete record with no annotation since I want to analyze the whole genome and not only the coding regions. I'm currently trying to play with the -outfmt parameter but this will take a while since each analysis take quite a time. Web BLAST is out of the question since the analysis is too CPU intensive (resulting in CPU usage limit on web BLAST)

If anyone knows how to solve this, I ask you kindly to show me the way. Thanks.

blastn DNA alignment • 1.5k views
ADD COMMENT
0
Entering edit mode

goal of the analysis is to identify nucleotide sequences exclusive of Pseudomonas.

Just to be clear. You are trying to find sequences exclusive to Pseudomonas as compared to rest of GenBank?

But then you say this

I came up with an idea to use local blastn to blast search the whole genome against the Pseudomonas nt database.

What is the query here?

ADD REPLY
0
Entering edit mode

The query is Pseudomonas aeruginosa PAO1 (reference genome). The query will be blasted against the nt database. The idea is that running P. aeruginosa against the Pseudomonas genus will identify sequences within the genus, and running against bacteria (excluding Pseudomonas) will ensure the sequences are exclusive to Pseudomonas. I thought about running a core genome analysis, but I don't think I have the equipment for it (too CPU intensive). Analyses against the nr/nt database includes sequences from GenBank+EMBL+DDBJ+PDB+RefSeq. I'm also using BLAST DBv5 to facilitate the inclusion or exclusion of taxids.

ADD REPLY
0
Entering edit mode

I am having trouble understanding how you think this is going to work. How are you deciding thresholds for sequence similarity? nt contains diverse sequences so most of PAO1 genome would hit something in the database. Excluding Pesudomonas genomes from other bacterial genomes would still generate plenty of hits to genomes that are taxonomically similar.

Assuming you did identify sequences that are Pesudomonas specific what do you want to use them for?

Have you considered using sourmash (https://sourmash.readthedocs.io/en/latest/command-line.html ) to manage genomes that could go into these comparisons.

ADD REPLY
0
Entering edit mode

The idea is to use E-value and bit-score to evaluate sequence similarity and then filter out the sequences from the output file. By running BLAST against nt, we could select the non-hit sequences and predict them as Pseudomonas-specific. Before that, as I said, we would BLAST PAO1 against the genus Pseudomonas (and these sequences will be analyzed against nt). I'm not sure this will work since I've never done something in this scale. The goal is to use Pseudomonas-specific regions for primer design. Again, the person who asked me this did not give me enough information. This person's idea is based on an article that analyzes the presence of bacterial sequences in a diversity of cancers. According to this article, many bacteria reads were found in samples of cancer (could just be contamination), including Pseudomonas-like, Acinetobacter-like, Rodentia spp., H. pylori, and others. Since the article is so vague, I was asked to idenfity Pseudomonas-specific sequences. It would be much easier if it was just one species. Basically, the final objective is to design primers for Pseudomonas

ADD REPLY
0
Entering edit mode

By running BLAST against nt, we could select the non-hit sequences and predict them as Pseudomonas-specific.

Start there and see what you get. I am afraid you are going to get some hits to something until you start playing with search stringency. Then the sequence with "no hits" may not necessarily be Pseudomonas specific. It would be just an artifact of your search.

Since the article is so vague, I was asked to idenfity Pseudomonas-specific sequences.

If you suspect your cancer data to contain bacterial sequences it may be good to start there. Use something like bbsplit from BBMap suite to bin host reads away and then use the remaining reads to search against nt or nr bacteria. If you consistently see something than it may be one avenue to go forward.

ADD REPLY
0
Entering edit mode

Thank you. I think this analysis I proposed is extremely intensive both for normal computers and one person. I'm trying another strategy. After reading more about primers (I'm not that familiar with primer design, just know the basics), I think the person who asked me to do this wants a universal primer for Pseudomonas. Since this is going a little off-topic, I'm going to open a new post.

ADD REPLY

Login before adding your answer.

Traffic: 1661 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6