Finding The Sequence Of A Domain
3
4
Entering edit mode
13.4 years ago
Shweta ▴ 90

I want to know how to get the amino sequence of a protein domain; e.g. ice (interleukin converting enzyme) has 2 domains- the CARD domain and PeptidaseC14 domain. Although I have the fasta sequence of the entire ICE protein, I'd like to know the what sequence stretches the CARD domain, and likewise, the PeptidaseC14 domain. (I have seen a page in the KEGG database that shows this demarcation, but am not able to recollect it)

protein domain • 11k views
ADD COMMENT
1
Entering edit mode

Check this answer for a similar question

Extract Domain Sequences From Multiple Sequences

ADD REPLY
0
Entering edit mode

@Moon, can you provide your FASTA file?

ADD REPLY
0
Entering edit mode

Uniprot sequence:

sp|P29466|CASP1_HUMAN Caspase-1 OS=Homo sapiens GN=CASP1 PE=1 SV=1 MADKVLKEKRKLFIRSMGEGTINGLLDELLQTRVLNKEEMEKVKRENATVMDKTRALIDS VIPKGAQACQICITYICEEDSYLAGTLGLSADQTSGNYLNMQDSQGVLSSFPAPQAVQDN PAMPTSSGSEGNVKLCSLEEAQRIWKQKSAEIYPIMDKSSRTRLALIICNEEFDSIPRRT GAEVDITGMTMLLQNLGYSVDVKKNLTASDMTTELEAFAHRPEHKTSDSTFLVFMSHGIR EGICGKKHSEQVPDILQLNAIFNMLNTKNCPSLKDKPKVIIIQACRGDSPGVVWFKDSVG VSGNLSLPTTEEFEDDAIKKAHIEKDFIAFCSSTPDNVSWRHPTMGSVFIGRLIEHMQEY ACSCDVEEIFRKVRFSFEQPDGRAQMPTTERVTLTRCFYLFPGH

ADD REPLY
0
Entering edit mode

@Moon, yes my method worked on your sequence. There results are here http://biocluster.ucr.edu/~alevchuk/finding-the-sequence-of-a-domain/results/ - but I obfuscated the AA sequences with X's just in case I'm right in my suspicions that this is a homework assignment.

ADD REPLY
0
Entering edit mode

@Moon, yes my method worked on your sequence. There results are here http://biocluster.ucr.edu/~alevchuk/finding-the-sequence-of-a-domain/results/ Welcome to Biostars.org!

ADD REPLY
6
Entering edit mode
13.4 years ago

The domains of your interest CARD and Peptidase_C14 are both in the Pfam25.0 A database so the following method will work.

This method has the folowing advantages:

  • It works on arbitrary sequences. Even ones that don't exist in any public databases. For example simulated data.
  • It works on arbitrary protein domain HMM models. For example if you build your own models from MSAs.
  • It's completely scripted. No need click on potentially a large number of links.

Step 1

To extract the sequences of domains you will first need the start and end positions. The following shows how to get the positions with HMMER 3 tool and the Pfam25.0 A database.

I assume that your original sequence is in my.fasta

  1. 001-download-data
  2. 002-prepare-hmm
  3. 003-scan
  4. 004-extract-coords

NOTE: For less well-known domains, you can repeat this search for Pfam25.0 B.

Output of Step 1: my.fasta-found-domains.tab-extract.tab


Step 2

Now that you have the coordinates, you can extract the sequences for the domain with the following R script:

Output of Step 2: results


The whole package

To run the entire method with example data, do this:

mkdir finding-the-sequence-of-a-domain
cd finding-the-sequence-of-a-domain

wget https://raw.github.com/alevchuk/finding-the-sequence-of-a-domain/master/001-download-data
wget https://raw.github.com/alevchuk/finding-the-sequence-of-a-domain/master/002-prepare-hmm
wget https://raw.github.com/alevchuk/finding-the-sequence-of-a-domain/master/003-scan
wget https://raw.github.com/alevchuk/finding-the-sequence-of-a-domain/master/004-extract-coords
wget https://raw.github.com/alevchuk/finding-the-sequence-of-a-domain/master/005-extract-seq

chmod +x 00*

time ./001-download-data    # Takes ~45 seconds

# Requires HMMER 3
time ./002-prepare-hmm      # Takes ~1.5 minutes
time ./003-scan my.fasta    # Takes ~30 seconds

time ./004-extract-coords my.fasta-found-domains.tab

# Requires Biostrings R package
# (install instructions here http://www.bioconductor.org/packages/release/bioc/html/Biostrings.html)
time ./005-extract-seq my.fasta

Resulting FASTA files will be in the results directory, just like these ones: https://github.com/alevchuk/finding-the-sequence-of-a-domain/tree/master/results

ADD COMMENT
0
Entering edit mode

I'm planning to add to my answer a script that will do the final step for the procedure: extract the sequences from the FASTA file.

ADD REPLY
0
Entering edit mode

I'm thinking of someday re-writing 005-extract-seq in Bash and using fastacmd for brevity.

ADD REPLY
0
Entering edit mode

I'm thinking of someday re-writing 005-extract-seq basing it on fastacmd for brevity. Like here.

ADD REPLY
0
Entering edit mode

hi i am traying your method on my 100 protein sequences to extract the domain,do i need to download Hmm3 tool?and run the perl script?

ADD REPLY
0
Entering edit mode

Yes, you need to download and compile HMMER3. Also, installing HMMER3 binaries in your path will make things simpler. Then the Bash scripts should run. Tested only a few times.

ADD REPLY
3
Entering edit mode
13.4 years ago
Stajich ▴ 30

I think we also have a solution this with BioPerl, the code is available here

It assumes you've run hmmer3 with --domtblout option and you are passing that and the FASTA file of you proteins as arguments.

ADD COMMENT
2
Entering edit mode
13.4 years ago
Lyco ★ 2.3k

There are several protein domain databases that provide this information. Three examples are PFAM, SMART and PROSITE. And there is Interpro, which contains data from various domain databases. All these databases provide a server where you can paste a protein sequence or uniprot accession number and will be informed about the position of the domain. If you are mainly interested in the sequence of the domain instance, you should try SMART and PROSITE because those two give you the sequence directly. With the other databases, you have to extract the domains based on the reported domain position.

For well-known domains (as the ones in your example) there is another possibility: those domains are often annotated directly in the uniprot entry see e.g. here If you scroll down to the feature table, the CARD domain is listed and you can click on the line '1-91' and get the sequence of the CARD domain highlighted.

ADD COMMENT
0
Entering edit mode

Thank you very much Lyco for your prompt reply. Uniprot has the domain of CARD only, so I referred to the sequence length given in Pfam to find the sequence of the other domain

ADD REPLY

Login before adding your answer.

Traffic: 1927 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6