Hello,
I did an assembly of a yeast strain. The number of contigs and their size is fine.
I now try to assign each contig to the corresponding chromosome.
I create a subset of the first 10,000nts of each contig, with bedtools getfasta
, that I blastn
either against NR database or against the reference S288C I downloaded from NCBI , what is much faster.
I use :
blastn -query contig1.fasta -db NR -out contig1.blastresults.1-10000 -evalue 1e-10 -outfmt '6 qseqid sseqid evalue pident length salltitles' -num_threads 32 -max_target_seqs 1
I then parse the output to do the correspondance contig <-> chromosome. When I compare the results for both database, for certain contigs, the affiliated chromosome can be different. Most of the time, it is the same results.
My question is more "biological" I guess. Do you know why there is a difference of chromosome for this well-known species? Is there a difference between the NR yeast database and the reference genome S288C?
Best
I agree with genomax. Running Mummer (your assembly vs the reference) would be much faster and more accurate than blast.
Thanks for your feedback. In reality, I have 16 assemblies : that's why I wanted to go with BLAST which is pretty easy to parse the results. I will have a look at aligners.
You could also simply become much more strict with your parameters (e.g., e-100), and do some kind of filtering of the repetitive parts of the query or db sequence (as their being present multiple times in the genome can falsify your results).