Different blast results between NR or local genome
1
0
Entering edit mode
23 months ago
pablo ▴ 310

Hello,

I did an assembly of a yeast strain. The number of contigs and their size is fine.

I now try to assign each contig to the corresponding chromosome.

I create a subset of the first 10,000nts of each contig, with bedtools getfasta , that I blastn either against NR database or against the reference S288C I downloaded from NCBI , what is much faster.

I use :

blastn -query  contig1.fasta -db NR  -out contig1.blastresults.1-10000 -evalue 1e-10 -outfmt '6 qseqid sseqid evalue pident length salltitles' -num_threads 32 -max_target_seqs 1

I then parse the output to do the correspondance contig <-> chromosome. When I compare the results for both database, for certain contigs, the affiliated chromosome can be different. Most of the time, it is the same results.

My question is more "biological" I guess. Do you know why there is a difference of chromosome for this well-known species? Is there a difference between the NR yeast database and the reference genome S288C?

Best

blast yeast • 965 views
ADD COMMENT
2
Entering edit mode
23 months ago
GenoMax 147k

Since blast is doing local alignments you may be getting those kind of results. Always use the smallest database you logically need. There is no need to use a large database if you already know you have S288C data.

Also use an aligner that is looking at global alignments (if you are searching with 10kb reads). minimap2, lastz or even blat may all be better than blast.

ADD COMMENT
0
Entering edit mode

I agree with genomax. Running Mummer (your assembly vs the reference) would be much faster and more accurate than blast.

ADD REPLY
0
Entering edit mode

Thanks for your feedback. In reality, I have 16 assemblies : that's why I wanted to go with BLAST which is pretty easy to parse the results. I will have a look at aligners.

ADD REPLY
0
Entering edit mode

You could also simply become much more strict with your parameters (e.g., e-100), and do some kind of filtering of the repetitive parts of the query or db sequence (as their being present multiple times in the genome can falsify your results).

ADD REPLY

Login before adding your answer.

Traffic: 2111 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6