Question

Visualize and annotate protein domains within a genomic context, given complement

0

Entering edit mode

6 weeks ago

Madde ▴ 20

I have a gene of interest which is in genomic location 1374721-1384266 in my genome. However, I know this gene is on the reverse strand, as indicated by "-" in the gff file.

AP012332.1      Prodigal:2.6    CDS     1374718 1384266 .       -       0       ID=DPDCJFFM_01065;product=hypothetical protein

I am trying to annotate the protein domains within this gene, while providing the correct genomic nucleotide position as well as the true amino acid position. Therefore, position 1 in the amino acid is the start of the protein, which, because this gene is on the reverse strand, would be the last or end of the gene's genomic nucleotide position. I am unsure about what the nucleotide positions are.

Description    Start (aa)   End (aa)   Start (nt)   End (nt)
Acyl transferase    13  335    1374760(?)   1375726(?)

I took the gene from ncbi here: https://ncbi.nlm.nih.gov/protein/757812890 and downloaded the nucleotide and amino acid sequence for input into NCBI's "conserved domains" tool.

Is the sequence from NCBI in the reverse complement orientation? If so, would amino acid #13 - 335 correspond to a different start and end nucleotide position?

What other tools can I use to figure out this problem?

complement protein genomics • 377 views

ADD COMMENT • link updated 5 weeks ago by cmdcolin ★ 4.0k • written 6 weeks ago by Madde ▴ 20

score 2 · Accepted Answer · 2024-11-12

when it says you have, in your above example, an Acyl transferase starting at position 13, the position of that is calculated counting 13*3(bases per codon) from the end position of your feature (because the amino acid is transcribed from the reverse strand, going "right to left" so "end to start"), so you get something like this:

the Acyl transferase domain starts, on the genome, at ~1384266-13*3
the Acyl transferase ends, on the genome, at ~1384266-335*3

I might have an off-by-one error in that calculation but that's the general idea

I created a tool that can help you map between genome and protein coordinate systems here https://github.com/cmdcolin/g2p_mapper_cli

another probably more common way to do it is with something like TxDB https://bioconductor.org/packages/devel/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.html