Question

Map genome positions onto protein coordinates?

2

Entering edit mode

19 months ago

cmdcolin ★ 4.2k

I am looking for a way to do the following

1) reliably find a protein structure e.g. pdb file or pre-computed alphafold results that is associated with a particular gene/transcript isoform. I found a way to do this somewhat for human genes using biomart, but i'd like to be able to do this for 'any species' (reason: i make tools, and I want to allow people to use my tool on any species of interest).

2) find a way to map genome coordinates onto that protein structure (3d position is relevant, but i guess just knowing the index into the 1d amino acid chain gets you most of the way there?). I feel like this is something variant annotation tools do, but is there a small purposeful code tool that does this instead of full fledged 'variant annotation'? my current way of doing things just looks at gff, takes every three letters of the CDS features, increments into the amino acid count, but I have a feeling this is not the most reliable way of doing things.

footnote: my gene to pdb structure biomart query i found...useful for now, but would be interested in finding a similar thing for other species http://useast.ensembl.org/biomart/martview/643c564ac8b632a4791ea866fb79f8e5?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id|hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id_version|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id_version|hsapiens_gene_ensembl.default.feature_page.pdb&FILTERS=&VISIBLEPANEL=attributepanel

pdb protein variant • 2.4k views

ADD COMMENT • link 18 months ago by cmdcolin ★ 4.2k

0

Entering edit mode

Regarding

1) reliably find a protein structure e.g. pdb file or pre-computed alphafold results that is associated with a particular gene/transcript isoform. I found a way to do this somewhat for human genes using biomart, but i'd like to be able to do this for 'any species' (reason: i make tools, and I want to allow people to use my tool on any species of interest).

For any species of interest maybe https://www.uniprot.org/uniprotkb?query=(database:AlphaFoldDB) is what you are looking for

ADD REPLY • link 19 months ago by andres.firrincieli 3.9k

0

Entering edit mode

I would love to take advantage of the results on "alphafolddb", however, is there a way to connect the data shown there to genomic coordinates?

ADD REPLY • link 19 months ago by cmdcolin ★ 4.2k

0

Entering edit mode

When possible, entries in uniprot are cross-referenced with the GeneBank NCBI database (see this: link). So, there must be a way to recover the genomic coordinates

ADD REPLY • link 19 months ago by andres.firrincieli 3.9k

0

Entering edit mode

that is interesting to see that cross reference to genbank, i can indeed see from this that there is a "genomic DNA" cross reference that goes here https://www.ncbi.nlm.nih.gov/protein/ONL99085.1 which then has another reference to CM007647.1 in that file which is the Zea mays chromosome coordinates...i will have to check how many uniprot IDs have this, but I like that the coordinates and "joins" from the Genbank file format are explicitly mapping between genomic and protein sequences instead of doing potentially sketchy GFF math based on assumptions. will be a hop,skip,and a jump to get the full pipeline together from all this info, but it is a good lead. thanks!

ADD REPLY • link 19 months ago by cmdcolin ★ 4.2k

score 1 · Answer 1 · 2023-10-25

1

Entering edit mode

19 months ago

benformatics 4.1k

In R:

https://bioconductor.org/packages/devel/bioc/vignettes/ensembldb/inst/doc/coordinate-mapping.html

ADD COMMENT • link 19 months ago by benformatics 4.1k

1

Entering edit mode

thanks, i believe this helps with my "problem 2"! somehow that doc link was not working (i think all their "devel" links got removed suddenly somehow), but this does https://bioconductor.org/packages/release/bioc/vignettes/ensembldb/inst/doc/coordinate-mapping.html

this appears to require "EnsDb" so limited to species in ensembl, but i also found from that link TxDb, which has a function for creating a TxDb for an arbitrary GFF which is maybe the type of thing I am looking for https://bioconductor.org/packages/release/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.html

ADD REPLY • link 19 months ago by cmdcolin ★ 4.2k

score 1 · Answer 2 · 2023-10-26

1

Entering edit mode

18 months ago

Jiyao Wang ▴ 380

You can use NCBI gene table, e.g., https://www.ncbi.nlm.nih.gov/gene/346689?report=gene_table. In iCn3D, the isoforms, exons and genomic positions are shown for AlphaFold or PDB structures, e.g., https://structure.ncbi.nlm.nih.gov/icn3d/share.html?pA3pPu7LxdiuZDVX7

enter image description here

ADD COMMENT • link 18 months ago by Jiyao Wang ▴ 380

0

Entering edit mode

this is great to hear. I actually signed up for the iCn3D workshop (today!) so look forward to learning more

ADD REPLY • link 18 months ago by cmdcolin ★ 4.2k

0

Entering edit mode

Hi Jiyao, this was a great workshop. Do you have any resources about how this can be done programmatically? I see that the iCn3D tool is able to do this to some extent internally but how does it do so? By querying the NCBI database? can 3rd party tools do this? I am trying to come to terms with the fact that the sequence in the PDB can contain multiple proteins, with post translational modifications, and I want to properly do these mappings without just trying to 'hackily' assume each 3 letters of the genome map incrementally map to a amino acid chain from the pdb (which i feel is an assumption that can be broken in many cases)

ADD REPLY • link 18 months ago by cmdcolin ★ 4.2k

1

Entering edit mode

You need to get the geneID first. For PDB structures, you may be able to retrieve it from RCSB PDB APIs. For AlphaFold UniProt IDs, you can do this way: url = "https://rest.uniprot.org/uniprotkb/search?format=json&fields=xref_geneid,gene_names&query=" + structure; let geneData = await me.getAjaxPromise(url, 'json'); let geneId = (geneData.results[0] && geneData.results[0].uniProtKBCrossReferences && geneData.results[0].uniProtKBCrossReferences[0]) ? geneData.results[0].uniProtKBCrossReferences[0].id : undefined; let geneSymbol = (geneData.results[0] && geneData.results[0].genes && geneData.results[0].genes[0] && geneData.results[0].genes[0].geneName) ? geneData.results[0].genes[0].geneName.value : 'ID ' + geneId;

Then you can use NCBI API to get the isoform and exon information: https://www.ncbi.nlm.nih.gov/Structure/vastdyn/vastdyn.cgi?geneid2isoforms=[geneid], e.g., https://www.ncbi.nlm.nih.gov/Structure/vastdyn/vastdyn.cgi?geneid2isoforms=7157

Or you can parse the information yourself: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&rettype=gene_table&retmode=text&id=7157

ADD REPLY • link 18 months ago by Jiyao Wang ▴ 380

0

Entering edit mode

big thank you :) I will look into all of this, will probably take a little while!

ADD REPLY • link 18 months ago by cmdcolin ★ 4.2k