Question

Retrieving Conserved Domains For Transcripts From Cdd

0

Entering edit mode

12.2 years ago

Arun 2.4k

Hi there, I have a set of transcripts from tomato for which I'd like to get their predicted protein function by looking for conserved domains within these transcripts, preferably from NCBI CDD database (or any other alternative you know of). Ideally it would be great if I could get the type of domain, the location this domain extends on that transcript and its score (E-value). Do you think there is a way I could code this, in say perl, to communicate with the CDD database and automate for all transcripts I have?

Thank you very much!

• 3.4k views

ADD COMMENT • link updated 12.2 years ago by Malcolm.Cook ★ 1.5k • written 12.2 years ago by Arun 2.4k

0

Entering edit mode

are you looking to automatize the steps that one would do via the CD Search Service?

ADD REPLY • link 12.2 years ago by Istvan Albert 102k

0

Entering edit mode

Istvan, yes, that's basically it.

ADD REPLY • link 12.2 years ago by Arun 2.4k

score 1 · Answer 1 · 2012-09-07

You want to do a translated rps-blast search of your transcripts into the CDD (or other) database at NCBI. Thats what the CD-Search web page does.

Given:

Your transcripts are in a fasta file, myTranscripts.fa. (in this example it contains a single sequence for Solanum lycopersicum glutamate dehydrogenase (gdh1), mRNA)
blast+, from NCBI, is installed on your computer
your computer is on the internet, and so will be able to use the -remote option to blast+
you want the output in tabular format

Then call:

rpstblastn -remote -db cdd  -outfmt 7  -evalue 0.1 < myTranscripts.fa > myTranscriptsCDD.tab

Output will look like:

# RPSTBLASTN 2.2.25+
# Query: gi|350540019|ref|NM_001246921.1| Solanum lycopersicum glutamate dehydrogenase (gdh1), mRNA
# RID: 4K0GGR16016
# Database: cdd
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 18 hits found
gi|350540019|ref|NM_001246921.1|    gnl|CDD|178095    82.97    411    69    1    70    1302    1    410    0.0     822
gi|350540019|ref|NM_001246921.1|    gnl|CDD|30682    50.39    381    187    2    160    1299    31    410    2e-159     462
gi|350540019|ref|NM_001246921.1|    gnl|CDD|133445    54.39    228    103    1    595    1278    1    227    2e-122     361
gi|350540019|ref|NM_001246921.1|    gnl|CDD|201083    46.22    238    123    4    595    1296    1    237    6e-105     317
gi|350540019|ref|NM_001246921.1|    gnl|CDD|202408    51.15    131    64    0    160    552    1    131    2e-63     204
... etc

(ehrm... BioStar is not treating the tabs correctly. It looks write whilst previewing the post, but not once finally posted. Go figure.)

whose first hit is to the domain glutamate dehydrogenase

Notes:

You can search other databases of domains (pfam, smart, etc) as are listed on the web form
Refer to blast+ manual for more fine grained control over search and output format options
On a mac using homebrew install blast+ with brew install blast --use-llvm
If you already have the old NCBI blast installed, either upgrade, or use the old-style blastcl3 for running a remote blast (warning: it is 'deprecated')
The numbers in the 2nd column can be pasted into the 'Direct fetch via UID' form here. Translating them in bulk to a description is another problem.
You can also use the more complicated URI API from your favorite language, for example, perl, as documented
Searching pfam using HMMs remotely at Sanger is also possible