What Is The Largest Collection Of Multiple Alignments?
6
6
Entering edit mode
13.6 years ago

I am doing research that requires a large collection of Multiple Sequence Alignments.

What is the largest downloadable collection?

Here is what I have looked at so far.

Pfam 25 (April 2011) has 12,273 families. It looks like they have several MSAs for each family but it seems to be hard to extract MSAs from the set available at ftp downloads.

In BAliBASE3 I counted 386 MSAs.

InterPro 32.0 (April 2011) has 14,469 families but does not prvide MSAs.

There must be something large available somewhere.

multiple database • 3.0k views
ADD COMMENT
5
Entering edit mode
13.6 years ago

Current version of Conserved Domain Database (CDD, v2.25) contains 37, 632 alignment models. This number is inclusive of Pfam and SMART based models incorporated in CDD along with 6056 domains curated by NCBI.

See the following URLs for more details: CDD URL: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml CDD 2011 Manuscript: http://www.ncbi.nlm.nih.gov/pubmed/21109532

ADD COMMENT
5
Entering edit mode
13.6 years ago

The current version of Ensembl has the following collections of protein/CDS MSAs:

18653 gene coding EnsemblCompara GeneTrees

755 ncRNA EnsemblCompara NcTrees

ftp://ftp.ensembl.org/pub/current_emf/ensembl-compara/homologies/

as well as 574924 Ensembl+Uniprot Protein Families, downloadable via mysql and the Perl API.

http://www.ensemblgenomes.org contains GeneTrees for different clades (ftp/mysql/API):

ensembl_compara_bacteria_9_62=38332
ensembl_compara_fungi_9_62=13506
ensembl_compara_metazoa_9_62=36063
ensembl_compara_pan_homology_9_62=43561
ensembl_compara_plants_9_62=37001
ensembl_compara_protists_9_62=15513

Other similar resources are http://treefam.org and http://phylomedb.org

ADD COMMENT
5
Entering edit mode
13.6 years ago
Lyco ★ 2.3k

Just a note of caution (probably unnecessary as I am sure you know what you are doing): You should not focus entirely on the sheer size of the multiple alignment collection.

Beside the usual concerns about alignment quality, there are big differences in the range of species that are covered, and the alignment databases also have a different approach towards ortholog/paralog situations. Thus, the choice of the optimal database depends on what you plan to do with it.

I myself work a lot with alignment collections, and for some purposes it is fine to have as many paralogs as possible (pfam style), while for other purposes you need alignment of pure ortholog groups (I use Evola and TreeFam but I am not entirely happy with those). The databases derived from Ensembl data(there are several of those) tend to have lots of mammalian and vertebrate gene models in them with very little sequence divergence but lots of gene prediction artefacts (missing exons, additional exons, intron remnants...)

ADD COMMENT
4
Entering edit mode
13.5 years ago
Abhiman ▴ 130

I agree with Lyco. There are a lot of collections of multiple sequences alignments with varying degrees of species representation, sequence divergence and number of alignments. The choice of the resource will depend on what you want to do with them. I can think of some more resources like[?][?]

ADD COMMENT
2
Entering edit mode
13.6 years ago

You can also download lots of MSAs from the eggNOG database.

ADD COMMENT
0
Entering edit mode
ADD COMMENT
0
Entering edit mode

That's a list of alignment software, not a list of alignments.

ADD REPLY

Login before adding your answer.

Traffic: 1965 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6