How to solve the different number of hits from tblastn when I use the option -parse_seqids in makeblastdb command?
1
0
Entering edit mode
5.2 years ago
carina2817 ▴ 20

Hello,

I am running tblastn and I am getting different hit numbers depending on if I add the option -parse_seqids to makeblastdb .I initially ran the following command to make the database:

makeblastdb -in GCA_003024985.1_Erow_1.0_genomic_Euperipatoides_rowelli.fna -dbtype nucl

then I ran tblastn with the next command:

tblastn -db GCA_003024985.1_Erow_1.0_genomic_Euperipatoides_rowelli.fna -query S_cerevisiae_all_prot_uniq_join.fa -out S_cerevisiae_all_prot_E_rowelli_tblastn_ful_test_original.out -outfmt '6 qseqid qgi qlen sseqid bitscore length pident qcovs evalue qstart qend sstart send qseq sseq' -num_threads 20

The output has 194 hits.

But then I was trying to retrieve the hits DNA sequences from tblastn to run the reciprocal blast test and I had an error which I solved adding the option -parse_seqids to makeblastdb [https://github.com/lindenb/jvarkit/issues/134]:

makeblastdb -in GCA_003024985.1_Erow_1.0_genomic_Euperipatoides_rowelli.fna -dbtype nucl -out blastdb_E_rowelli -parse_seqids

and I ran tblastn again with this new database:

tblastn -db blastdb_E_rowelli -query S_cerevisiae_all_prot_uniq_join.fa -out S_cerevisiae_all_prot_E_rowelli_tblastn_ful_test.out -outfmt '6 qseqid qgi qlen sseqid bitscore length pident qcovs evalue qstart qend sstart send qseq sseq' -num_threads 20

This time I am getting 195 hits. I am working with many genomes and I have this problem with some of them (the biggest difference is 20 hits). Do you have any idea how to correct this or which output I should select?

Thanks.

Paola

Blast makeblastdb tblastn • 4.2k views
ADD COMMENT
0
Entering edit mode

Can you confirm that makeblastdb -parse_seqids does not give a warning like "duplicate accessions found" or something like that

ADD REPLY
0
Entering edit mode

I don't get any warning message

ADD REPLY
1
Entering edit mode
5.1 years ago
carina2817 ▴ 20

It comes out the problem was not the "-parse_seqids" option, I discovered that every time I ran blast with some of the genomes I am using the results file had a different number of hits and this happens (with some genomes) in all blast versions after 2.3.0. I sent an e-mail to blast support and they told me that's a new bug and that they would make a report, the problem is produced when using many processors, using 1 processor does not produce the problem and the results must be consistent at some number of processors above 1.

ADD COMMENT
0
Entering edit mode

oh! thanks for updating your post

ADD REPLY

Login before adding your answer.

Traffic: 2850 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6