Entering edit mode
11.4 years ago
Max
▴
150
I have been using UCSC's genome browser to extract exon sequences and their coordinates. One problem that I've encountered is that in the annotation tables (including exonFrames, etc), the exonStarts/exonEnds are close to those found in the sequences, but not precisely.
For instance, notice that the start (top) and (bottom) positions for exons below are close to, but don't quite match, those for the sequence data:
name chrom strand exonStarts exonEnds exonFrames
NM_032291 chr1 + 66999824,67091529,67098752,67101626,67105459,67108492,67109226,67126195,67133212,67136677,67137626,67138963,67142686,67145360,67147551,67154830,67155872,67161116,67184976,67194946,67199430,67205017,67206340,67206954,67208755, 67000051,67091593,67098777,67101698,67105516,67108547,67109402,67126207,67133224,67136702,67137678,67139049,67142779,67145435,67148052,67154958,67155999,67161176,67185088,67195102,67199563,67205220,67206405,67207119,67210768, 0,1,2,0,0,0,1,0,0,0,1,2,1,1,1,1,0,1,1,2,2,0,2,1,1,
.
>hg19_refGene_NM_032291_0 range=chr1:67000042-67000051 5'pad=0 3'pad=0 strand=+ repeatMasking=none
ATGATGGAAG
>hg19_refGene_NM_032291_1 range=chr1:67091530-67091593 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GATTGAAAAAACGTACAAGGAAGGCCTTTGGAATACGGAAGAAAGAAAAG
GACACTGATTCTAC
>hg19_refGene_NM_032291_2 range=chr1:67098753-67098777 5'pad=0 3'pad=0 strand=+ repeatMasking=none
AGGTTCACCAGATAGAGATGGAATT
>hg19_refGene_NM_032291_3 range=chr1:67101627-67101698 5'pad=0 3'pad=0 strand=+ repeatMasking=none
CAGCCCAGCCCACACGAACCACCCTACAATAGCAAAGCAGAGTGTGCGCG
TGAAGGAGGAAAAAAAGTTTCG
Unfortunately, the sequences and the data tables have to be called separately, so I'm left with trying to resolve the matter from the conflicting data that I have.
what should we see in your example ? where is the problem ?
To give a specific example, the first exon from the list is:
Now, the coordinates that are given for this exon are: 66999824 (START), 67000051 (STOP), while the exonFrame is 0.
The first issue is the mismatch beween the start positions (by a single nucleotide, though for other exons the mismatch can be by 2 or more). The second issue is how to interpret the exon frame variable. If the exonFrame variable is 0, is this with respect to the entire exon, or just with respect to the CDS region?
It is a bit hard to follow exactly what you are doing here... Can you provide the precise queries you are performing to obtain these data? Are you doing this manually in the browser or using an SQL query? What is your ultimate goal?
I've been working manually with the browser.
Basically, I need the following information: coding exon sequence (excluding 5' and 3'UTR) coordinates of coding exon sequence reading frame (0,1,2) of exon sequence.
Is there some way of obtaining this information with a single query?
Is this question still relevant? If you have solved it, you should have uploaded your own answer. Besides it isn't clear what is being asked here. Vote for closing.