Question

Genomic Positions Of Protein Domains

5

Entering edit mode

14.4 years ago

Pascal ▴ 130

Hi there,

I am looking for a database containing genomic positions of known protein domains. In principle I need the genomic start and stop position on the genome of domain of each gene. I know, that these positions would span introns, but this is not important for my purpose. Is there something like that? I took a look at BioMart and other sources, but mostly I just got the position on the protein sequence, not the abolute position on the genome.

Regards

protein protein • 12k views

ADD COMMENT • link updated 14.4 years ago by Khader Shameer 18k • written 14.4 years ago by Pascal ▴ 130

0

Entering edit mode

did you find a solution, that can be used by others?

ADD REPLY • link 13.2 years ago by Bioinfosm ▴ 620

score 8 · Answer 1 · 2010-11-12

8

Entering edit mode

14.4 years ago

Pierre Lindenbaum 166k

To create this resource I would:

get the table knownGene from UCSC: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/knownGene.txt.gz
build the cDNA and the translated protein using 'knownGene' and the reference sequences: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes
fetch the swissprot entry for each gene
align the swissprot entry and your reconstituted protein
map the annotation from the swissprot entry to your reconstituted protein
map the position of the domains back to the reference genome.

ADD COMMENT • link 14.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I was hoping to get around that, but this is maybe the best possibility.

ADD REPLY • link 14.4 years ago by Pascal ▴ 130

0

Entering edit mode

Sorry, probably it is a silly question, but I don't get how do you builkd the translated protein using knownGene. Moreover, when you say align the SWP entry and the reconstituted, you are talking about aminoacids, isn't it? So, my problem is that i can follow the procedure, because you are aligning proteins and then you must go back to DNA

ADD REPLY • link 13.1 years ago by Tonig ▴ 440

0

Entering edit mode

yes but the knownGene table contains the structure of the exon on the genomic reference; So, you can map each amino acid back to a a base on the genome; See my program backlocate: http://code.google.com/p/variationtoolkit/wiki/BackLocate

ADD REPLY • link 13.1 years ago by Pierre Lindenbaum 166k

score 7 · Answer 2 · 2010-11-12

Few weeks back, I was also looking for such a resource for my analysis and realised exactly what you figured out: you won't be able to get this information from BioMart. I contacted the Ensembl help desk and they suggested me to integrate data using Ensembl resources (some of the data via Biomart and rest via the Ensembl Core/Variation API). So you have two options now, you may explore the Ensembl API path or proceed as described by Pierre using UCSC resources. Also remember, it will get a bit more complex due to alternate transcripts and alternate exons; This can change final protein product and exact genomic location of the domains, because of this complexity you may not be able to get a perfect one to one-mapping.

score 5 · Answer 3 · 2010-11-12

What you're describing is a coordinate conversion problem.

It's possible of course; if you know all of the required coordinates in both coordinate systems (i.e. exons and domains, in amino acid and nucleotide coordinates), but it is quite technically challenging.

One solution, if you're comfortable in Perl/Bioperl might be the Bioperl module Bio::Coordinate::GeneMapper, which was written for just this purpose. There may be similar libraries available for other languages.

As Pierre mentioned, you may also be able to use the UCSC tables, many of which have positional information.

score 3 · Answer 4 · 2010-11-12

3

Entering edit mode

14.4 years ago

biobot 0.0.77.a.1099 6.2k

You say that you are using Biomart? Does that mean your genome of interest is in Ensembl? If so, the work may already done for the annotated protein domains; these are stored as Bio::EnsEMBL::ProteinFeatures which have a location both on the protein (in protein coordinates) and on the genome (in chromosome coordinates).

To find these you would need to obtain genes, then transcripts and from those the translations. Given a translation, you can get the protein features and then filter these to include only those whose analysis type you require e.g. Pfam.

While this is possible according to the API docs, I don't know whether these data are present for your organism. It's probably worth checking, though because it will only take a short script to find out.

ADD COMMENT • link 14.4 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

That's what I checked first. Unfortunately it seems like BioMart (the web site) doesn't offer any positional information about protein domains at all. I also checked out the Perl API, but I wasn't able to get genomic positions, but only the location on the protein.

ADD REPLY • link 14.4 years ago by Pascal ▴ 130

0

Entering edit mode

No Biomart doesn't offer this, but the fact that the data are in Biomart means that there is very probably a core Ensembl database for your organism and you can use the API to get the information.

ADD REPLY • link 14.4 years ago by biobot 0.0.77.a.1099 6.2k

score 2 · Answer 5 · 2010-11-12

2

Entering edit mode

14.4 years ago

Darked89 4.7k

For COGs there is Genome ProtMap:

http://www.ncbi.nlm.nih.gov/sutils/protmap.cgi?cluster=COG4690E&result=map

The (very) hard way would be to map selected Pfam domains back to genome of interest using genewise.

ADD COMMENT • link 14.4 years ago by Darked89 4.7k

score 1 · Answer 6 · 2010-11-12

1

Entering edit mode

14.4 years ago

Michael Kuhn 5.0k

PAL2NAL is a tool that can project a protein alignment onto nucleotide sequences. It's not exactly meant for what you want to do, but might be usable if you use the domain sequence and the nucleotide sequence of the gene.

ADD COMMENT • link 14.4 years ago by Michael Kuhn 5.0k