How To Find 50 Homolgous Sequences But Not So Close Related?
5
0
Entering edit mode
11.0 years ago
onpelikan • 0

Hi, I'm searching for e.g. 50 sequences in Not redudundat blast database. I want to test program for protein mutation prediction - program tries to estimate if mutation is deleterious or neutral.

Example of analyzed sequence is well known lacI repressor. Blast finds lot of sequences but too much similar. First 50 sequences are almost the same and prediction program has no heterogentity for it's prediction model.

How to find homogous sequences but not the same (I want orthologs). E. g. sequences from another species and little bit different than human LacI protein.

I tried classic blastp. Another way I tried: first run blastp for 2000 sequences and then align these sequences and this alignment get to psiblast as PSSM (-in_msa parameter). Is there other automatic way or parameter settings for Blast+ package to find more distant sequences?

EDIT: Constraint - searching process have to be automatic. It is one of the component of a bigger tool.

blast • 3.8k views
ADD COMMENT
0
Entering edit mode

I would guess you need to define some sort of constraints - i.e. (1) bitscore thresholds, (2) species subset (or a distance) and (3) conserved domain(s), and then see which blast hits will satisfy these.

ADD REPLY
2
Entering edit mode
11.0 years ago
5heikki 11k

You could filter tabular blast output with e.g. awk to only include hits that have smaller than whatever similarity percentage:

awk '$3 <= 95 {print}' tabularBlastOutputFile | awk '$3 >= 85 {print}' > hitsBetween85And95SimilarityPercentage
ADD COMMENT
2
Entering edit mode
11.0 years ago

You're looking for a search with an improved sensitivity. Try a profile-based search, e.g. HMMer with pfam.

ADD COMMENT
0
Entering edit mode

HMMer returns lot of sequences so I clustered it with cd-hit and this process got the best results for mutation analysis with MAPP program.

ADD REPLY
1
Entering edit mode
11.0 years ago
jackuser1979 ▴ 890

You can do with BLASTO blast designed for orthologue search. Try search in eggNOG database or DRSC tool.

ADD COMMENT
0
Entering edit mode

Thank you. This is really interesting projects/tools but I need command line program (such as blast+ programs).

ADD REPLY
0
Entering edit mode

Is there please any way to download all sequences in fasta? I can't see anything.

ADD REPLY
1
Entering edit mode
11.0 years ago
Asaf 10k

You can run PSI-BLAST and choose the proteins you get in the second or third iteration.

ADD COMMENT
1
Entering edit mode

And by the way, your question reminds me of the construction of BLOSUM, maybe you'll find interesting insights in the original paper.

ADD REPLY
0
Entering edit mode

This is another good advice.

ADD REPLY
0
Entering edit mode

1) I need the blast to be automatic process without manual work.

2) I will check the original paper. Thank you.

ADD REPLY
1
Entering edit mode
11.0 years ago
Spitshine ▴ 660

If you do not want to rely on an orthologous groups database, modify your input set to include diverse sequences by cd-hit (http://weizhong-lab.ucsd.edu/cd-hit/).

This is how protein families were built in the olden days of biocomputing.

ADD COMMENT
0
Entering edit mode

This is probably one of the best solution. One possible is let blastp search e.g. 3000 sequences and then obtain 50 representative sequences from cd-hit clustering .

ADD REPLY

Login before adding your answer.

Traffic: 1386 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6