Question

Different blast results between NR or local genome

0

Entering edit mode

23 months ago

pablo ▴ 310

Hello,

I did an assembly of a yeast strain. The number of contigs and their size is fine.

I now try to assign each contig to the corresponding chromosome.

I create a subset of the first 10,000nts of each contig, with bedtools getfasta , that I blastn either against NR database or against the reference S288C I downloaded from NCBI , what is much faster.

I use :

blastn -query  contig1.fasta -db NR  -out contig1.blastresults.1-10000 -evalue 1e-10 -outfmt '6 qseqid sseqid evalue pident length salltitles' -num_threads 32 -max_target_seqs 1

I then parse the output to do the correspondance contig <-> chromosome. When I compare the results for both database, for certain contigs, the affiliated chromosome can be different. Most of the time, it is the same results.

My question is more "biological" I guess. Do you know why there is a difference of chromosome for this well-known species? Is there a difference between the NR yeast database and the reference genome S288C?

Best

blast yeast • 965 views

ADD COMMENT • link updated 23 months ago by SequenceServer ▴ 20 • written 23 months ago by pablo ▴ 310

score 2 · Answer 1 · 2022-12-14

2

Entering edit mode

23 months ago

GenoMax 147k

Since blast is doing local alignments you may be getting those kind of results. Always use the smallest database you logically need. There is no need to use a large database if you already know you have S288C data.

Also use an aligner that is looking at global alignments (if you are searching with 10kb reads). minimap2, lastz or even blat may all be better than blast.

ADD COMMENT • link 23 months ago by GenoMax 147k

0

Entering edit mode

I agree with genomax. Running Mummer (your assembly vs the reference) would be much faster and more accurate than blast.

ADD REPLY • link 23 months ago by Buffo ★ 2.4k

0

Entering edit mode

Thanks for your feedback. In reality, I have 16 assemblies : that's why I wanted to go with BLAST which is pretty easy to parse the results. I will have a look at aligners.

ADD REPLY • link 23 months ago by pablo ▴ 310

0

Entering edit mode

You could also simply become much more strict with your parameters (e.g., e-100), and do some kind of filtering of the repetitive parts of the query or db sequence (as their being present multiple times in the genome can falsify your results).

ADD REPLY • link 23 months ago by SequenceServer ▴ 20