Hi, all!
I'm currently working with the repeatMasker GTF file from UCSC, and I found some of the transcript ids were tagged with "dup" like below.
gene_id "AluY"; transcript_id "AluY_dup1";
I first thought they were the unique ID for each element, but many with the same ID exist in various loci with varying length.
Name Size Location
ERV24B_Prim-int 287 chrX:42414944-42415231
ERV24B_Prim-int 194 chr12:132958913-132959107
ERV24B_Prim-int 399 chr11:7878526-7878925
ERV24B_Prim-int 341 chr21:44875114-44875455
ERV24B_Prim-int 3636 chr7:154925107-154928743
ERV24B_Prim-int_dup1 143 chr1:13401681-13401824
ERV24B_Prim-int_dup1 189 chr2:1330786-1330975
ERV24B_Prim-int_dup1 217 chr20:37081775-37081992
ERV24B_Prim-int_dup1 1823 chr3:26276217-26278040
ERV24B_Prim-int_dup1 130 chr4:27866205-27866335
ERV24B_Prim-int_dup1 133 chr5:44478664-44478797
ERV24B_Prim-int_dup1 373 chr6:40084087-40084460
I also compared the sequences of the elements with the same ID, but couldn't fine sequence similarity.
Can someone explain what's the difference of "dup" tagged IDs and why there are so many of them?
Thank you!
Thank you for the explanation! So... whether an item has the "dup" tag or not, all of them are the same type of repeats, and the redundant "dup" are some errors in Table Browser? If so, can I just ignore the "dup" tags? Am I understanding right?
Sorry for the late reply but yes that is correct. The problem here is the GTF output function from the Table Browser, there have been issues with that specific function for a long time.
If you want you can just download the rmsk table and work on it directly: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz
And here is a link to the schema of that file: http://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=rep&hgta_track=rmsk&hgta_table=rmsk&hgta_doSchema=describe+table+schema