Hi,
I had some follow-up questions to this for the very n00b user. I'd like to make a phylogenetic tree of a specific type of sigma factors. So if, for example, I search rpoD in Kegg (which is the gene name for the primary housekeeping gene in E. coli), it'll pop up 1000's of genes. Within these 1000's of genes, most of them have a "motif" that describes a group 2 and a group 4 region of the sigma factor. What I would like to do is search all of these genes based on the group 2 and group 4 amino acid sequences and bin them according to their similarity. So these 1000's of genes from many organisms would be put into 10 or so classes based on the similarity of BOTH the group 2 and group 4 sequences.
I have downloaded MEGA and Seaview, and both appear to be able to handle the creation of this from their description, but I'm having a very hard tim getting started. A little direction would be most appreciated. How do I "collect" all of the sigma factors that will serve as the tree that I want to construct? And how do I search for just the group 2 and group 4 amino acid sequence and score by similarity?
Thank you to all in advance!
You could start by searching at NCBI. See the protein/protein cluster hits in the right column. You can collect sequences from those two groups based on criteria you choose (what kind of organisms etc).
Briefly --> Select one of the clusters --> Click on the name of the cluster to open the page for that cluster --> Click on "Protein" link under Related Information at top right of the page --> On "Proteins" page that opens select all entries --> Click on "Summary" at top left of the page --> Change to "Fasta" --> Use "Send to" button and select "file" to send the sequences to a file that you can save. This file then can be used as input for your alignment/tree construction in MEGA.
Is this a homework/assignment question?
Hi genomemax2,
Thanks for the reply! I'm actually beginning a research project, and so while I don't need to grab every conceivable genome, I do need to get a good coverage of the representative genomes that have are available. There are some great papers where this has been done, but they aren't detailed enough on the "how" for the n00b user (understandably).
Your advice was very helpful, and it was great to actually see a tree. But NCBI doesn't seem to have enough coverage (only a few 100 sequences)? On KEGG, if I search for rpoD, and then select one of the 100's of organisms (http://www.genome.jp/dbget-bin/www_bget?ko:K03086) and pick the Group 2 protein motif, it pops up 119,000 matches (http://www.genome.jp/dbget-bin/www_bget?pf:Sigma70_r2). How I would get this into a FastA file is one big challenge? And then I need to do the same thing with the Group 4 protein motif, and query the database to rank the similarity between organisms Group 2 and Group 4 sites (e.g the same Group 2 site but different Group 4, etc.). Any further guidance?
If you were to move one line up to "proteins" you can see that there are almost 78K sequences. So that should be plenty for you.
As you have discovered below you probably don't need each and every one of these sequences since many are likely identical. If you really want to make an alignment/tree from an enormous number of sequences then you may need to move this analysis to server with a good bit of RAM/processor power and use T-coffee, Muscle, Phylip etc.
So I guess the question boils down to the following:
How do you do a multiple sequence alignment (2 sequences, Group 2 and Group 4 of sigma factors) with broad breadth across genomes, outputted as fasta files?