Hello,
I am running tblastn and I am getting different hit numbers depending on if I add the option -parse_seqids to makeblastdb .I initially ran the following command to make the database:
makeblastdb -in GCA_003024985.1_Erow_1.0_genomic_Euperipatoides_rowelli.fna -dbtype nucl
then I ran tblastn with the next command:
tblastn -db GCA_003024985.1_Erow_1.0_genomic_Euperipatoides_rowelli.fna -query S_cerevisiae_all_prot_uniq_join.fa -out S_cerevisiae_all_prot_E_rowelli_tblastn_ful_test_original.out -outfmt '6 qseqid qgi qlen sseqid bitscore length pident qcovs evalue qstart qend sstart send qseq sseq' -num_threads 20
The output has 194 hits.
But then I was trying to retrieve the hits DNA sequences from tblastn to run the reciprocal blast test and I had an error which I solved adding the option -parse_seqids to makeblastdb [https://github.com/lindenb/jvarkit/issues/134]:
makeblastdb -in GCA_003024985.1_Erow_1.0_genomic_Euperipatoides_rowelli.fna -dbtype nucl -out blastdb_E_rowelli -parse_seqids
and I ran tblastn again with this new database:
tblastn -db blastdb_E_rowelli -query S_cerevisiae_all_prot_uniq_join.fa -out S_cerevisiae_all_prot_E_rowelli_tblastn_ful_test.out -outfmt '6 qseqid qgi qlen sseqid bitscore length pident qcovs evalue qstart qend sstart send qseq sseq' -num_threads 20
This time I am getting 195 hits. I am working with many genomes and I have this problem with some of them (the biggest difference is 20 hits). Do you have any idea how to correct this or which output I should select?
Thanks.
Paola
Can you confirm that
makeblastdb -parse_seqids
does not give a warning like "duplicate accessions found" or something like thatI don't get any warning message