Write a regular expression that describes the alignment in the box.
Find 5 protein sequences from different organisms or strains that contain the pattern described by the regular expression from Q1. List the ID, name, size, source, and function of each protein.
Find 2 proteins with known structures that contain the pattern described by the regular expression from Q1. List the IDs of found protein structures.
Build a multiple sequence alignment for all protein sequences from Q2 and Q3.
Identify the conserved regions in the alignment from Q4 and explore their biological significance.
Evaluate statistical parameters of the regular expression from Q1 based on similar expressions in the Prosite database.
The first question is asking to write a regular expression that captures those sequences. Depending on what language you are writing this in there will be regex tutorials that you should go through.
Well, yes, but it also covers the sequences A and AA and AAA and every other sequence of alphabetical uppercase characters that is conceivable (including all sequences that contain non-amino acid letters).
You need to find one that covers (exactly) the given alignment. So best to look at the individual columns of the alignment and see what amino acids they're composed of. This should then give you an idea of how to build the regex.
This is clearly a homework assignment, and you should ask your instructor for details. It beats the educational goals of your instructor if we show you exactly how to do this. That said, here are couple of hints.
I am guessing that a regular expression assignment is about individual columns in your alignment rather than a full set of sequences. For example, this is a regular expression of the last 4 columns in your alignment:
[EG]-R-D-[IL]
This means that the last column is either I of L, next to the last is always D, the one before it always R, and the one before it is either E or G. You should check with your instructor, but I think that your assignment is to find this pattern across all columns, and then search the database for proteins that match the pattern you found.
For example, here is one protein that matches the whole pattern (the match is in red):
Would this be correct for the regular expression for the whole alignment?
[YF][KLYF][YFHX]R[YWSCRLI][LYWFVHS][RKX][HRKS][GSTE]K[LI][RNK][P][FY][EG]RD[LI]
Thank you for checking! My next question is how to find 5 protein sequences from different organisms or strains that contain the pattern described by the regular expression that I provided above . I have to list the ID, name, size, source, and function of each protein. How can I do that?
If you are doing this on the linux command line you can use grep with the regex. If you are using a programming language like Python or R they have functions to search strings using regex. Refer to the documentation for those languages for more information.
please change your title "Bioinformatics questions sequence". Of course it is a question about bioinformatics...
looks like a homework. what have you tried so far ?
not sure where to start
The first question is asking to write a regular expression that captures those sequences. Depending on what language you are writing this in there will be regex tutorials that you should go through.
would this be correct? regex = ([A-Z])+
yes but looks like it's a amino-acid alphabet (not A to Z) with a specific length...
what would you recommend then?
Well, yes, but it also covers the sequences
A
andAA
andAAA
and every other sequence of alphabetical uppercase characters that is conceivable (including all sequences that contain non-amino acid letters).You need to find one that covers (exactly) the given alignment. So best to look at the individual columns of the alignment and see what amino acids they're composed of. This should then give you an idea of how to build the regex.