Question

How To Use Blast To Find Exact Matches Of Short Sequences?

2

Entering edit mode

11.4 years ago

Free Man ▴ 180

Hi, I am using tblastn (under blast 2.2.25+) for exact peptide mapping (no gaps).
I want to map few peptides (about 6 to 50 AAs in length) to genome.
However, as I test a known peptide of 6 AAs,tblastn failed to mapped this peptide.
I have read the doc of blast, but failed to find a solution. What did I miss?
Thank you!
PS. I have also tried PGM (ProteogenomicMapping). This tool can map the known peptide tested above correctly, but it's slow in my computer which is impossible for large scale mapping.

blast • 15k views

ADD COMMENT • link updated 11.4 years ago by SRKR ▴ 180 • written 11.4 years ago by Free Man ▴ 180

score 3 · Answer 1 · 2013-08-15

3

Entering edit mode

11.4 years ago

SRKR ▴ 180

In the command line BLAST there is an option -perc_identity. You can use this, keep it as 100 and then run the blast. With that setting you will be able to get hits only if there is 100% identity. The command would be like this:

blastn -db dbname -query input_file -out output_file -perc_identity 100

you can try this and I believe it will work. you can also use word-size to get hits with even shorter peptides, like your case. It's value should be a minimum of 2 in case of tblastn

-word_size 3

you can always get to know all the options available by typing -h (brief) or -help (detailed) after the blast type

tblastn -help

hope this helps...

ADD COMMENT • link 11.4 years ago by SRKR ▴ 180

1

Entering edit mode

Hi, I got error: "Error: (CArgException::eInvalidArg) Unknown argument: "perc_identity"".
I did not find something like 'perc_identity' in the help doc for tblastn. It seems it is only avaliable for blastn. So what version are you using?

ADD REPLY • link 11.4 years ago by Free Man ▴ 180

1

Entering edit mode

yeah I am sorry, just now noticed that -perc_identity is not available with tbalstn. The best option that seems to be the case is to use -ungapped, which will avoid gaps, but still it might result in mismatches.

ADD REPLY • link 11.4 years ago by SRKR ▴ 180

0

Entering edit mode

What is your genome size? If it isn't too big a script can be useful to you to get the positions. Just have to six frame translate the genome and search for your amino acid sequences in the translates. You will get the positions all through the genome.

ADD REPLY • link 11.4 years ago by SRKR ▴ 180

0

Entering edit mode

Thanks for you suggestion! After tedious attempts using various parameters, I got the solution for my project:
Key parameters: -comp_based_stats 0 -ungapped -matrix PAM 30 -seg no

ADD REPLY • link 11.4 years ago by Free Man ▴ 180