Question

Mapping Protein Domains To Exons

6

Entering edit mode

13.4 years ago

Anupam Sinha ▴ 60

I have a set of protein sequences (from Biomart ENSEMBL Release 62) of Homo sapiens. On these sequences I have run PFAM-HMMER to find domains on these sequences. I also have coordinates (coding sequence as well as genomic)for the exons that make up these proteins. I want to map the protein domains to these exons. How do I accomplish this ? Solutions in any of these- PERL, R, Bioconductor,Python, Biomart- will be very helpful. Thanks in advance.

exon protein mapping r • 12k views

ADD COMMENT • link updated 13.4 years ago by Pierre Lindenbaum 166k • written 13.4 years ago by Anupam Sinha ▴ 60

1

Entering edit mode

In the future, please add a note when you cross-post to other forums like the bioconductor email list.

ADD REPLY • link 13.4 years ago by Sean Davis 27k

score 2 · Answer 1 · 2012-01-03

2

Entering edit mode

13.4 years ago

Sean Davis 27k

I have already answered this on the bioconductor email list, but to summarize:

Use the GenomicFeatures bioconductor package to get the locations of the exons.
Import your PFAM information as a GRanges object (from the GenomicRanges bioconductor package).
Use findOverlaps() method to find overlaps between the two interval sets.

ADD COMMENT • link 13.4 years ago by Sean Davis 27k

0

Entering edit mode

Thanks a lot Sean

ADD REPLY • link 13.3 years ago by Anupam Sinha ▴ 60

0

Entering edit mode

But the pfam information is a protein sequence, i.e aminoacids, so how doyou import this with Genomic ranges and then overlap with the exons? Because, what you are trying to get is the genomic coordinates of this domains

ADD REPLY • link 13.2 years ago by Tonig ▴ 440

score 1 · Answer 2 · 2012-01-03

I don't have a solution but wish to point out a pitfall. Small exons or exons that encode only the extreme amino- or carboxy-terminal portion of a Pfam domain may score too low to be retained for your downstream analysis. One solution to this is to map your Pfam hits against the model and note where there are missing residues. For example, an exon of 60 nt (20 aa) may encode Pfam model residues 1 to 20, but you don't see that hit (high E-value, low score, etc). At the same time, your first real, high-scoring hit begins with Pfam residue 21. Hmm, where are those first 20 aa? Are they in the upstream exon?

score 1 · Answer 3 · 2012-01-03

1

Entering edit mode

13.4 years ago

Pierre Lindenbaum 166k

I've modified one of my program to create protein2genome, a tool mapping a peptide/start/end to the human genome using the ucsc knowngene database.

The C++ source is available here: http://code.google.com/p/variationtoolkit/source/browse/trunk/src/protein2genome.cpp and has justbeen integrated in my experimental package: varkit

Example:

$ echo -e "#Pep\tpepStart\tpepEnd\tDomain\nZC3H7B\t82\t115\tTPR\nNOTC2_HUMAN\t26\t63\tEGF_DOMAIN" |\
  protein2genome |\
  verticalize

>>> 2
$1  #Pep                ZC3H7B
$2  pepStart            82
$3  pepEnd              115
$4  Domain              TPR
$5  knownGene.name      uc003azw.2
$6  knownGene.chrom     chr22
$7  knownGene.strand    +
$8  knownGene.exon      Exon 4
$9  domain.start        41721879
$10 domain.end          41721922
<<< 2

>>> 3
$1  #Pep                ZC3H7B
$2  pepStart            82
$3  pepEnd              115
$4  Domain              TPR
$5  knownGene.name      uc003azw.2
$6  knownGene.chrom     chr22
$7  knownGene.strand    +
$8  knownGene.exon      Exon 5
$9  domain.start        41723209
$10 domain.end          41723268
<<< 3

>>> 4
$1  #Pep                NOTC2_HUMAN
$2  pepStart            26
$3  pepEnd              63
$4  Domain              EGF_DOMAIN
$5  knownGene.name      uc001eik.2
$6  knownGene.chrom     chr1
$7  knownGene.strand    -
$8  knownGene.exon      Exon 2
$9  domain.start        120572528
$10 domain.end          120572609
<<< 4

>>> 5
$1  #Pep                NOTC2_HUMAN
$2  pepStart            26
$3  pepEnd              63
$4  Domain              EGF_DOMAIN
$5  knownGene.name      uc001eik.2
$6  knownGene.chrom     chr1
$7  knownGene.strand    -
$8  knownGene.exon      Exon 3
$9  domain.start        120548178
$10 domain.end          120548211
<<< 5

ADD COMMENT • link 13.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks a lot Pierre

ADD REPLY • link 13.3 years ago by Anupam Sinha ▴ 60

0

Entering edit mode

Dear Pierre, how would I use your tool if I had EnsEMBL sequences instead of UCSC known genomes

ADD REPLY • link 13.0 years ago by Anupam ▴ 40

1

Entering edit mode

there is a table mapping the UCSC known Genes to ENSEMBL: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownToEnsembl.txt.gz

ADD REPLY • link 13.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Pierre,

Would this also work for cases where one protein domain is encoded by i.e. split across > 1 exon?

I want to only report the number of exons that encode a certain protein domain? How would I modify use of your protein2genome for my purposes?

ADD REPLY • link 9.7 years ago by Anand Rao ▴ 640

0

Entering edit mode

Dear Pierre,

I am also stuck with the same problem.

i have proteins domain start and end position as well as the .gff3 files for several plants can you help meto get position of domain start and end on cds? i have seen your page on "backlocate" but it works only on UCSC but my data is not available at UCSC or ncbi. thanks in advance !! i hope u will help me out

ADD REPLY • link 8.4 years ago by amish • 0