Hello,
I am writing a workflow on neisseria meningitidis. After the assembly (by Spades), I want to know that I really have this specie before continuing.
So, I added a blastn step: picked references from ncbi, and run blast on them. When I run blast on the complete genome, I find really good hits with the neisseria meningitidis - bit score max 50000 - but also some hits, very few but still, with another bacteria (salmonella enterica, bit score max 800).
On the other hand, when I cut my genome and pick only 500000 nucleotides, I don't get hits with the other species.
Is it normal?
Is it better to run blast only on a part of the genomes?
How can I know when the bitscore is good or not?
Thank you
Using a local aligner like blast for doing sequence similarity searches on whole genomes does not seem like a good idea. You are going to see hits to similar organisms (like you do above).
Perhaps you need to think about using an alternate like
bbsketch
: BBSketch - A Tool for Rapid Sequence ComparisonSo you are not sure if the starting sample is pure neisseria or if it contains other organisms?
Hi, thank you for your help and sorry if I wasn't clear.
I'm almost 100% that is a neisseria. But my boss want still add a step to check, in case there was a contamination or something.
So I picked reference assemblies of different species (including neis. meningitidis) from NCBI refseq and create a database, and run blast on it, with my samples as input.
How about flipping this test around? You could create a simulated read dataset (Illumina reads, PE, 100 bp) using the Neisseiria genome from RefSeq. Then use this to align against your assembly. Look for alignment % and coverage across the genome (depth). Former should be very high (depending on how similar your strains are to the reference).
It seem a great idea Thanks for your help!