Question

Identifying closest homologue of a protein sequence

1

Entering edit mode

11.3 years ago

nathanielsaxe ▴ 10

Hi,

I have this list of proteins from a new genome project so its pretty much unannotated. However, it's closely related to C. elegans so I was thinking of trying to identify the closest C. elegans homologues.

What I've been doing right now is doing a protein blast in ncbi with the protein sequences and then taking the top C. elegans hit, however, there are far too many sequences to be able to do this one at a time, so I was wondering if there's a way to do it faster/automated/program that does it for me.

Thanks!

blast • 3.8k views

ADD COMMENT • link updated 4.0 years ago by Ram 45k • written 11.3 years ago by nathanielsaxe ▴ 10

0

Entering edit mode

11.3 years ago

5heikki 11k

Standalone blast

In brief:

blastp -query yourSeqs.fasta \
  -subject CelegansSeqs.fasta (or make a db from them so you can multithread) \
  -seg yes \
  -soft_masking true \
  -use_sw_tback \
  -num_threads X (if you made a db, X for number of threads you CPU supports) \
  -out seqs-vs-Celegangs.tsv -outfmt 6

Output only best hits:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

sort -k1,1 -k12,12gr -k11,11g -k3,3gr seqs-vs-Celegangs.tsv | sort -u -k1,1 --merge > bestHits

There's a manual in the link too. The flags in blastp are for best homolog detection. These are from a publication, although I can't remember which one..

ADD COMMENT • link updated 4.0 years ago by Ram 45k • written 11.3 years ago by 5heikki 11k

0

Entering edit mode

11.3 years ago

David Fredman ★ 1.1k

I would suggest calling orthologs and paralogs between your species and C. elegans using the offline version of Inparanoid (by the Sonnhammer lab), which will essentially perform bi-directional Blast, and call orthologs with sensible cutoffs. It's very easy to run, and you can obtain it (by request) here.

Other alternatives would include

or mapping your proteins to the pre-calculated orthologous groups in eggNOG

ADD COMMENT • link updated 4.0 years ago by Ram 45k • written 11.3 years ago by David Fredman ★ 1.1k

0

Entering edit mode

11.3 years ago

Prakki Rama ★ 2.7k

You can also take a look reciprocal smallest distance.

ADD COMMENT • link updated 4.0 years ago by Ram 45k • written 11.3 years ago by Prakki Rama ★ 2.7k

Ram · Accepted Answer · 2014-07-25

Try using HMMER.

The manual is available here.

IN BRIEF: for each protein sequence in C. elegans you make a HMM using hmmbuild command. Concatenate all HMM models into a single file to make a database file. You have to use hmpress to create additional files in order to search your database. Now you can use either phmmer if you want to scan against the database you have just created or hmmsearch to scan individual models against the sequences you have. The documentation describes very well what commands you need, but note the subtle differences of scanning model vs set of sequences and set of sequences vs db of models.

If you have access to a parallel environment such as MPI (OpenMPI can usually be installed even on the local machines to take full advantage of multiple cores) then you can build the HMMER with MPI support to increase throughput.

A rough idea of a time in our use was: Building and pressing a database of ~10k models takes 10 mins (ish) scanning a coding sequence against a database of ~10k models takes 2-3 seconds. This is very rough guide that we have used it, which undoubtedly will differ from your use case.