Repeat names with "dup" at the end in repeatMasker GTF (UCSC) file?
1
0
Entering edit mode
6.8 years ago
mbk0asis ▴ 700

Hi, all!

I'm currently working with the repeatMasker GTF file from UCSC, and I found some of the transcript ids were tagged with "dup" like below.

gene_id "AluY"; transcript_id "AluY_dup1";

I first thought they were the unique ID for each element, but many with the same ID exist in various loci with varying length.

Name    Size    Location
ERV24B_Prim-int    287     chrX:42414944-42415231
ERV24B_Prim-int    194     chr12:132958913-132959107
ERV24B_Prim-int    399     chr11:7878526-7878925
ERV24B_Prim-int    341     chr21:44875114-44875455
ERV24B_Prim-int    3636    chr7:154925107-154928743
ERV24B_Prim-int_dup1    143     chr1:13401681-13401824
ERV24B_Prim-int_dup1    189     chr2:1330786-1330975
ERV24B_Prim-int_dup1    217     chr20:37081775-37081992
ERV24B_Prim-int_dup1    1823    chr3:26276217-26278040
ERV24B_Prim-int_dup1    130     chr4:27866205-27866335
ERV24B_Prim-int_dup1    133     chr5:44478664-44478797
ERV24B_Prim-int_dup1    373     chr6:40084087-40084460

I also compared the sequences of the elements with the same ID, but couldn't fine sequence similarity.

Can someone explain what's the difference of "dup" tagged IDs and why there are so many of them?

Thank you!

repeatMasker UCSC • 2.0k views
ADD COMMENT
0
Entering edit mode
6.7 years ago
genecats.ucsc ▴ 580

Here are the first few lines from the rmsk table for hg38, where I have filtered for items like "ERV24B":

#filter: (repName like 'ERV24B%')
#bin swScore milliDiv milliDel milliIns genoName genoStart genoEnd genoLeft strand repName repClass repFamily repStart repEnd repLeft id
675 1260 217 16 38 chr1 11902369 11902632 -237053790 + ERV24B_Prim-int LTR ERV1 1 257 -7356 1
687 444 285 7 0 chr1 13401680 13401824 -235554598 + ERV24B_Prim-int LTR ERV1 4350 4494 -3119 2

Here are the first two lines as BED output from the Table Browser:

chr1 11902369 11902632 ERV24B_Prim-int 1260 +
chr1 13401680 13401824 ERV24B_Prim-int 444 +

And here are the first two lines as GTF output from the Table Browser:

chr1 hg38_rmsk exon 11902370 11902632 1260.000000 + . gene_id "ERV24B_Prim-int"; transcript_id "ERV24B_Prim-int";
chr1 hg38_rmsk exon 13401681 13401824 444.000000 + . gene_id "ERV24B_Prim-int"; transcript_id "ERV24B_Prim-int_dup1";

From these 3 outputs, we can see that the the "dup" items are just the same type of repeat in a different genomic location, and with a different score (1260 v 444 in this example).

From looking at more output (all the chr1 items for instance), it appears that all "ERV24B_Prim-int" items on a chromosome have increasing "dup" id's, which is all probably a result of known issues with GTF output from the Table Browser.

If you have further questions about UCSC data or tools feel free to send your question to one of the below mailing lists:

  • General questions: genome@soe.ucsc.edu
  • Questions involving private data: genome-www@soe.ucsc.edu
  • Questions involving mirror sites: genome-mirror@ose.ucsc.edu

ChrisL from the UCSC Genome Browser

ADD COMMENT
0
Entering edit mode

Thank you for the explanation! So... whether an item has the "dup" tag or not, all of them are the same type of repeats, and the redundant "dup" are some errors in Table Browser? If so, can I just ignore the "dup" tags? Am I understanding right?

ADD REPLY
0
Entering edit mode

Sorry for the late reply but yes that is correct. The problem here is the GTF output function from the Table Browser, there have been issues with that specific function for a long time.

If you want you can just download the rmsk table and work on it directly: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz

And here is a link to the schema of that file: http://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=rep&hgta_track=rmsk&hgta_table=rmsk&hgta_doSchema=describe+table+schema

ADD REPLY

Login before adding your answer.

Traffic: 1728 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6