Question

Blastn - unexpected behaviour

1

Entering edit mode

5.8 years ago

bvm ▴ 20

I ran into an unexpected feature of blastn. After extracting some gene sequences from a genome, creating a blast database and blasting back to the reference, lot of extracted genes are not found in the blast result, while they are certainly there in the genome (as they were extracted from there) What can be the cause? Some details (I cannot upload the whole files):

My command is:

blastn -query GCF_000005845.2_ASM584v2_genomic.fna -db MG1655_genes -outfmt 6

The fasta file is downloaded from https://www.ncbi.nlm.nih.gov/genome/167?genome_assembly_id=161521.

The database is gained from extracting the feature table belonging to the assembly above.

A missing gene from the blast is e.g. aaaD. However, if blasting only this gene, it is naturally found.

blastn • 1.1k views

ADD COMMENT • link 5.8 years ago by bvm ▴ 20

1

Entering edit mode

Think you should blast the genes against the genome. So you index the genome first and then blast your genes in fasta format against it. This makes more sense and you can get the location of the gene.

ADD REPLY • link 5.8 years ago by gb ★ 2.2k

1

Entering edit mode

Using a genome as search query against a list of genes is probably not a great idea (unless you have a specific reason for it). Have you considered doing the search in reverse?

Also trying adding -task blastn to your command line to see if it makes a difference. Default is megablast.

ADD REPLY • link 5.8 years ago by GenoMax 151k

0

Entering edit mode

I thought of this approach because the goal is to find the same genes in a lot of genomes, but you're right to do it in the other way - I'll use the set of genes as query and the specific genomes as subjects.

If adding -task blastn there was a difference, but still not all genes occurred.

ADD REPLY • link 5.8 years ago by bvm ▴ 20

score 2 · Accepted Answer · 2019-08-14

After doing some research, I found the answer for my question. The value for max_target_seqs is 500 by default. If raising max_target_seqs to some irrationally high value, all genes are shown.

Hence I used

blastn -query GCF_000005845.2_ASM584v2_genomic.fna -db MG1655_genes -outfmt 6 -max_target_seqs 100000000

to obtain all genes.