I ran into an unexpected feature of blastn. After extracting some gene sequences from a genome, creating a blast database and blasting back to the reference, lot of extracted genes are not found in the blast result, while they are certainly there in the genome (as they were extracted from there) What can be the cause? Some details (I cannot upload the whole files):
My command is:
blastn -query GCF_000005845.2_ASM584v2_genomic.fna -db MG1655_genes -outfmt 6
The fasta file is downloaded from https://www.ncbi.nlm.nih.gov/genome/167?genome_assembly_id=161521.
The database is gained from extracting the feature table belonging to the assembly above.
A missing gene from the blast is e.g. aaaD
. However, if blasting only this gene, it is naturally found.
Think you should blast the genes against the genome. So you index the genome first and then blast your genes in fasta format against it. This makes more sense and you can get the location of the gene.
Using a genome as search query against a list of genes is probably not a great idea (unless you have a specific reason for it). Have you considered doing the search in reverse?
Also trying adding
-task blastn
to your command line to see if it makes a difference. Default ismegablast
.I thought of this approach because the goal is to find the same genes in a lot of genomes, but you're right to do it in the other way - I'll use the set of genes as query and the specific genomes as subjects.
If adding
-task blastn
there was a difference, but still not all genes occurred.