Question

Repeat names with "dup" at the end in repeatMasker GTF (UCSC) file?

0

Entering edit mode

7.1 years ago

mbk0asis ▴ 700

Hi, all!

I'm currently working with the repeatMasker GTF file from UCSC, and I found some of the transcript ids were tagged with "dup" like below.

gene_id "AluY"; transcript_id "AluY_dup1";

I first thought they were the unique ID for each element, but many with the same ID exist in various loci with varying length.

Name    Size    Location
ERV24B_Prim-int    287     chrX:42414944-42415231
ERV24B_Prim-int    194     chr12:132958913-132959107
ERV24B_Prim-int    399     chr11:7878526-7878925
ERV24B_Prim-int    341     chr21:44875114-44875455
ERV24B_Prim-int    3636    chr7:154925107-154928743
ERV24B_Prim-int_dup1    143     chr1:13401681-13401824
ERV24B_Prim-int_dup1    189     chr2:1330786-1330975
ERV24B_Prim-int_dup1    217     chr20:37081775-37081992
ERV24B_Prim-int_dup1    1823    chr3:26276217-26278040
ERV24B_Prim-int_dup1    130     chr4:27866205-27866335
ERV24B_Prim-int_dup1    133     chr5:44478664-44478797
ERV24B_Prim-int_dup1    373     chr6:40084087-40084460

I also compared the sequences of the elements with the same ID, but couldn't fine sequence similarity.

Can someone explain what's the difference of "dup" tagged IDs and why there are so many of them?

Thank you!

repeatMasker UCSC • 2.1k views

ADD COMMENT • link updated 7.1 years ago by genecats.ucsc ▴ 580 • written 7.1 years ago by mbk0asis ▴ 700

score 0 · Answer 1 · 2018-03-26

Here are the first few lines from the rmsk table for hg38, where I have filtered for items like "ERV24B":

#filter: (repName like 'ERV24B%')
#bin swScore milliDiv milliDel milliIns genoName genoStart genoEnd genoLeft strand repName repClass repFamily repStart repEnd repLeft id
675 1260 217 16 38 chr1 11902369 11902632 -237053790 + ERV24B_Prim-int LTR ERV1 1 257 -7356 1
687 444 285 7 0 chr1 13401680 13401824 -235554598 + ERV24B_Prim-int LTR ERV1 4350 4494 -3119 2

Here are the first two lines as BED output from the Table Browser:

chr1 11902369 11902632 ERV24B_Prim-int 1260 +
chr1 13401680 13401824 ERV24B_Prim-int 444 +

And here are the first two lines as GTF output from the Table Browser:

chr1 hg38_rmsk exon 11902370 11902632 1260.000000 + . gene_id "ERV24B_Prim-int"; transcript_id "ERV24B_Prim-int";
chr1 hg38_rmsk exon 13401681 13401824 444.000000 + . gene_id "ERV24B_Prim-int"; transcript_id "ERV24B_Prim-int_dup1";

From these 3 outputs, we can see that the the "dup" items are just the same type of repeat in a different genomic location, and with a different score (1260 v 444 in this example).

From looking at more output (all the chr1 items for instance), it appears that all "ERV24B_Prim-int" items on a chromosome have increasing "dup" id's, which is all probably a result of known issues with GTF output from the Table Browser.

If you have further questions about UCSC data or tools feel free to send your question to one of the below mailing lists:

General questions: genome@soe.ucsc.edu
Questions involving private data: genome-www@soe.ucsc.edu
Questions involving mirror sites: genome-mirror@ose.ucsc.edu

ChrisL from the UCSC Genome Browser