I have been using the R interface of BiomaRt to obtain exon sequences associated with certain genes. The problem is that I cannot find an appropriate Biomart attribute for the start codon position/offset (e.g. if the exon is 5' ACGTCAT...3', whether the first codon is ACG, CGT, or GTC).
The closest attribute that I could find is "phase," which determines the overlap between the exon and the previous intron. As I understand it, a phase of 0 would mean that nucleotide 1 of the exon is the first in the start codon (ACG) above, a phase of 1 means a sequence start at nucleotide 3 (GTC), and phase 2 a sequence start at nucleotide 2 (CGT). Is this correct? More importantly, is there some Biomart/ensemble attribute that I could call that would give me the starting codon position for every exon?
Basically, I am trying to find the startin nucleotide/codon for the open reading frame for every exon associated with a gene. As far as I can tell, there is no attribute in BioMart that will let me do this.
The concern that I have is whether I can use "phase" to determine the start codon in the exon. Am I correct in my inference of start codon position based on start phase (i.e. 0 corresponding to start =1, 1 to start = 3, 2 to start =3)?
Hi Max. There are a few options, none of which I think are exactly what you want. For an exon you can get:
Genomic coding start: the position of the first coding nucleotide compared to the genome
cDNA coding start: position of the first coding nucleotide compared to the cDNA
CDS: position of the first coding nucleotide (in that exon) compared to the coding sequence
You may be able to use some combination of these to get what you need.
Hello, I am having the same question and did not want to open a new question. i was wondering, since I can find the "frame" information in the GTF file what is the importance of the phase (and the end_phase) attribute? Shouldn't the phase and frame information be the same?
Thank you