Understanding the annotation tag
1
2
Entering edit mode
5.0 years ago

I am having trouble understanding the annotation tag. For e.g. NM_173900_utr3_0_0_chr5_104123604_r is one of the annotations tag. The annotation file was downloaded as a bed file from UCSC table browser.

I understand that NM_173900 is the ncbi accession i.d of the gene. _utr3_ means it’s in utr3,
I also understand the later part chr5_104123604_r means it’s in chr5 and the given position. -r means it’s in reverse (-) strand.

What has been bothering me are the two zeros in the middle. I am not able to figure out what they mean.

I have put here multiple examples:

NM_001075941_utr3_3_0_chr27_1251478_f

NM_001193172_up_2000_chr7_57415651_f

NM_001192104_cds_13_0_chr2_44344566_f

Please help me understand the numbers in the middle.

Thank you, Suraj

annotation NCBI UCSC • 1.3k views
ADD COMMENT
0
Entering edit mode

what is the "ucsc annotation tag" ? how did you get those identifiers ?

ADD REPLY
0
Entering edit mode

I got the annotation files from UCSC table browser. The annotation file looks like this.

chr1    1000624 1002224 NM_001034679_utr3_3_0_chr1_1000625_f    0   +
chr1    1046829 1047018 NM_001077977_utr3_2_0_chr1_1046830_f    0   +
chr1    1099124 1099325 NM_001077124_utr3_0_0_chr1_1099125_r    0   -
chr1    1102878 1103061 NM_001114516_utr3_0_0_chr1_1102879_r    0   -
chr1    1105542 1105555 NM_001114516_utr3_1_0_chr1_1105543_r    0   -
ADD REPLY
0
Entering edit mode

Annotation of what? What did you query?

ADD REPLY
0
Entering edit mode

I don't understand what information you want. Could you please explain?

ADD REPLY
0
Entering edit mode

The annotation file was downloaded as a bed file from UCSC table browser.

Table browser is not just a tool where you click once and it magically gives you data. You have to choose what you want to download. Without knowing what you downloaded, we have no idea what it means. So what did you download?

ADD REPLY
0
Entering edit mode

Thank you for clarifying it. I downloaded annotation for 3' UTR exons from the UCSC table browser by making following selections. Clade - Mammal; Genome - Cow; assembly - Apr. 2018, group - Genes and gene predictions; track - NCBI RefSeq region - genome; output format - BED; output file - 3' UTR Then on the get output tab; I selected 3' UTR exons and pressed get BED. This downloaded a file that I saved in my computer. Few lines of this file look like this

chr1 1000624 1002224 NM_001034679_utr3_3_0_chr1_1000625_f 0 +

chr1 1046829 1047018 NM_001077977_utr3_2_0_chr1_1046830_f 0 +

chr1 1099124 1099325 NM_001077124_utr3_0_0_chr1_1099125_r 0 -

chr1 1102878 1103061 NM_001114516_utr3_0_0_chr1_1102879_r 0 -

I want to understand what the numbers in the middle of the annotation tag mean.

ADD REPLY
0
Entering edit mode

I think the first information indicates the region (utr3_3 indicates 4th exon in utr3, up indicates upstream, cds_13 indicates 14th exon and the exon is coding). The second information indicates the relative position of the indicated region . NM_001193172_up_2000_chr7_57415651_f means chr7:57415651:+ is the location of 2000bp upstream of NM_001193172. NM_001192104_cds_13_0_chr2_44344566_f means chr2:44344566:+ is the first position of the 14th exon in NM_001192104 and this exon is a coding exon.

ADD REPLY
1
Entering edit mode
5.0 years ago
Luis Nassar ▴ 670

Hello,

We have an FAQ entry (albeit a bit hidden) on this question: http://genome.ucsc.edu/FAQ/FAQdownloads.html#download34

The fourth column of the BED output contains a lot of information separated by underscores. For example:

uc009vjk.2_cds_1_0_chr1_324343_f

This information is represented as follows:

ucscId_sequenceType_sequenceTypeNumber_basesAdded_chromosome_positionOfFirstBaseOfItem_strand

Keep in mind ucscId in this case is just the primary identifier for the track, so NCBI accessions. The third column "sequenceTypeNumber" is defined as follows:

Sequence Type Number: for every transcript, there will be a row for each sequence type (cds or intron) and this identifies which is represented in this row; the first is denoted with 0. So, if you requested exons, and a particular transcript has 10 exons, you will see a row for each one and in this position they will be numbered 0-9.

And the fourth column "basesAdded":

Bases Added: number of bases added to the regions requested.

So for the results you are observing, the first number is the exon/intron number, (0,1,2,etc) and the second number is the requested padding (0 in this case).

If you have any additional questions, you can reach out to our help desk at genome@soe.ucsc.edu. We check Biostars every now and again, but your question may go unanswered for some time. If you post a question here, using the "ucsc" helps to put it on our radar.

Lou
UCSC Genome Browser

ADD COMMENT

Login before adding your answer.

Traffic: 1837 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6