Question

blast only specific sequences from a large file

2

Entering edit mode

10.4 years ago

Kate ▴ 20

I have 600,000 + unique protein sequences that I want to blast against themselves. For example, I want to run sequence 1 against sequence 1, sequence 2 against sequence 2, ... sequence 3453 against sequence 3453 etc.

I have local blast installed on my computer and blastdb is prepared for the protein sequences. Problem is, each sequence has to be blast against the whole database when I will only retain the hit from a unique sequence against itself. Needless to say, I need a much more efficient method! Any idea?

alignment blast • 3.9k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.4 years ago by Kate ▴ 20

1

Entering edit mode

I'm curios of what kind of results you're expecting, as clearly the best hit in every case will have 100% similarity covering the entire query sequence length.

ADD REPLY • link 10.4 years ago by 5heikki 11k

2

Entering edit mode

10.4 years ago

mikhail.shugay 3.5k

Just iterate over your sequences and sum the diagonal element of substitution matrix (http://homepages.ulb.ac.be/~dgonze/TEACHING/stat_scores.pdf) for each amino acid, apply some basic math and you'll get the result (I suspect you are looking for a way to compute E-values).

ADD COMMENT • link 10.4 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Thanks, that would be a good idea except that I wouldn't know how to start computing this... I want to get the normalized score out of the sequences to get an idea of the highest score they can generate.

ADD REPLY • link 10.4 years ago by Kate ▴ 20

0

Entering edit mode

Pretty sure you can calculate the scores without doing a single blast utilizing only the information of sequence lengths, although there's of course database size parameter which will affect at least e-values, but maybe not bitscores (or at least raw scores)

ADD REPLY • link 10.4 years ago by 5heikki 11k

0

Entering edit mode

Thanks, that is my goal, so doing a blast is not something I look for as I already know my sequence are a perfect match and I don't need the evalue. I manage to find the different scoring matrices from NCBI, and I made a script to calculate raw score as suggested Mikhail. Now yes, I'll put in the length parameter to normalize it and I just have to find out how I can get the K and lambda parameters right.

ADD REPLY • link 10.4 years ago by Kate ▴ 20

1

Entering edit mode

10.4 years ago

mxs ▴ 530

Hi,

If I understood the problem correctly what I would do is to first create the database and blast one sequence against it using the default output format option. In that output file (if I remember correctly) there should be information about the effective database length (EFSIZE). Next, I would extract that number and then create a script like this one:

perl -e 'for(1..600000){system("blastp -query in.fa.$_ -subject in.fa.$_ -dbsize EFSIZE  -out bout.$_ -evalue ....")}'

where:

in.fa.1 - single sequence

Of course this means that the initial database should be divided accordingly.

I must admit that I haven't done any real blasting in a while so there is a chance I might be missing something here.

cheers

mxs

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by mxs ▴ 530

0

Entering edit mode

Thanks, this works too. As I only need the normalized score, running blastp without any other information works well. The speed is comparable to what I wrote on Pierre comment. thanks!

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.4 years ago by Kate ▴ 20

0

Entering edit mode

10.4 years ago

Jean-Karim Heriche 27k

Use a script to wrap a call to blastp for each protein using the two sequences form of blast: blastp -query ... -subject ...

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

But that would mean creating 600,000 database files (makeblastdb) prior to running blast... It would be quite heavy.

ADD REPLY • link 10.4 years ago by Kate ▴ 20

0

Entering edit mode

No, create one database with all your sequences, then blast the reads against that. This is very common. You would then filter out self hits and parse out whatever you are interested in from the report (it's not clear exactly what that is though).

ADD REPLY • link 10.4 years ago by SES 8.6k

0

Entering edit mode

That is what I did and I had to give up as each sequence is compared to hundreds of thousands which is a loss of time (it took 18 hours @ 24 cpu).

ADD REPLY • link 10.4 years ago by Kate ▴ 20

0

Entering edit mode

For the record, this way of using blast doesn't require a database. See Pierre Lindenbaum's post for a development of what I meant.

ADD REPLY • link 10.4 years ago by Jean-Karim Heriche 27k

Ram · Accepted Answer · 2015-03-30

3

Entering edit mode

10.4 years ago

Pierre Lindenbaum 166k

curl  -ks "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz" | gunzip -c |\
awk '/^>/{printf("\n%s\n",$0);next;} {printf("%s",$0);} END{printf("\n");}' | grep -v '^$' |\
while read T
do
    echo "$T" > prot.fa 
    read S
    echo "$S" >> prot.fa
    blastp -query prot.fa -subject prot.fa
done

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

All right this works! On a test file, my previous method (blasting against all database) took 17 minutes on 24 cpu, yours takes half of the time (8 minutes). I thought it would go much much quicker, but well...

ADD REPLY • link 10.4 years ago by Kate ▴ 20