The domains of your interest CARD and Peptidase_C14 are both in the Pfam25.0 A database so the following method will work.
This method has the folowing advantages:
- It works on arbitrary sequences. Even ones that don't exist in any public databases. For example simulated data.
- It works on arbitrary protein domain HMM models. For example if you build your own models from MSAs.
- It's completely scripted. No need click on potentially a large number of links.
Step 1
To extract the sequences of domains you will first need the start and end positions. The following shows how to get the positions with HMMER 3 tool and the Pfam25.0 A database.
I assume that your original sequence is in my.fasta
- 001-download-data
- 002-prepare-hmm
- 003-scan
- 004-extract-coords
NOTE: For less well-known domains, you can repeat this search for Pfam25.0 B.
Output of Step 1: my.fasta-found-domains.tab-extract.tab
Step 2
Now that you have the coordinates, you can extract the sequences for the domain with the following R script:
Output of Step 2: results
The whole package
To run the entire method with example data, do this:
mkdir finding-the-sequence-of-a-domain
cd finding-the-sequence-of-a-domain
wget https://raw.github.com/alevchuk/finding-the-sequence-of-a-domain/master/001-download-data
wget https://raw.github.com/alevchuk/finding-the-sequence-of-a-domain/master/002-prepare-hmm
wget https://raw.github.com/alevchuk/finding-the-sequence-of-a-domain/master/003-scan
wget https://raw.github.com/alevchuk/finding-the-sequence-of-a-domain/master/004-extract-coords
wget https://raw.github.com/alevchuk/finding-the-sequence-of-a-domain/master/005-extract-seq
chmod +x 00*
time ./001-download-data
time ./002-prepare-hmm
time ./003-scan my.fasta
time ./004-extract-coords my.fasta-found-domains.tab
time ./005-extract-seq my.fasta
Resulting FASTA files will be in the results directory, just like these ones:
https://github.com/alevchuk/finding-the-sequence-of-a-domain/tree/master/results
Check this answer for a similar question
Extract Domain Sequences From Multiple Sequences
@Moon, can you provide your FASTA file?
Uniprot sequence:
@Moon, yes my method worked on your sequence. There results are here http://biocluster.ucr.edu/~alevchuk/finding-the-sequence-of-a-domain/results/ - but I obfuscated the AA sequences with X's just in case I'm right in my suspicions that this is a homework assignment.
@Moon, yes my method worked on your sequence. There results are here http://biocluster.ucr.edu/~alevchuk/finding-the-sequence-of-a-domain/results/ Welcome to Biostars.org!