I'm running blastn from the command line and keep getting weird results -- about half of the sequences have terrible alignments to the query, but have the same scores and E values as perfect matches. Using the -perc_identity
option doesn't seem to change anything. I've tried changing just about everything except for the blast database.
If anyone has any advice I'd be very grateful.
ETA: This is using BLASTN 2.2.30+.
Command line used: blastn -db ../i20.fasta -query query1.txt -out test.txt -perc_identity 100
Example of results:
>HS3:591:C5VMVACXX:4:1101:10795:94073 1:N:0:CTAAGGTC
Length=100
Score = 102 bits (55), Expect = 2e-20
Identities = 12/55 (22%), Gaps = 0/55 (0%)
Strand=Plus/Plus
Query 1 GTATTTTTCAATTCTATTTACGCGTATAATATATCTTCGTCAACTATTGTGGAGT 55
| | | | || || | | | |
Sbjct 33 GGTGGTAAGATTCAATAAATACAATACAAGACACACATTATCTAATGGCTCTTTT 87
>HS3:591:C5VMVACXX:4:2316:8876:76137 1:N:0:CTAAGGTC
Length=100
Score = 102 bits (55), Expect = 2e-20
Identities = 55/55 (100%), Gaps = 0/55 (0%)
Strand=Plus/Plus
Query 1 GTATTTTTCAATTCTATTTACGCGTATAATATATCTTCGTCAACTATTGTGGAGT 55
|||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 41 GTATTTTTCAATTCTATTTACGCGTATAATATATCTTCGTCAACTATTGTGGAGT 95
If you blast against very big contigs, then blast will return all the matches that fit your threshold. For instance your query could match many times against chromosome 1; so the score isn't for any particular match but for chromosome 1 in total.
Can you provide the blast version and the exact command line used and we can provide more help.
Thanks for the reply. I added the version and command line I used to the post.
The database I'm searching is a bunch of sequencing reads, so they're not huge contigs. The thing that's really confusing me is that one sequence that only matches in 12 positions has the same score as another sequence one that matches all the way across.
Can you post the read sequences that were the queries in your example output? And the species of interest (I don't have your database). I wanted to recreate this locally.
Thanks. The query is just the 55 bp sequence shown in the example—the sequencing reads are in the database. This is in an Arabidopsis mutant.
(I hope I'm being clear enough. I'm a newbie to bioinformatics!)
Can you pull the sequencing reads from the database?