Question

Repetitive sequences analysis

0

Entering edit mode

5 weeks ago

frarodmar17 • 0

I am trying to analyse RNA-seq data and I want to use a gtf annotation file of RepeatMasker to quantify the reads, but I am finding one problem with all annotation files from UCSC RepeatMasker: every single file contain repeated sequences identifiers (transcript_id por example). I do not know how to deal with it, because I found on another post that this file does not have problems: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz. However, I have tried to convert it into a valid format for featureCounts function through many ways but I have not been capable of it. I would appreciate whatever new idea or alternative.

Repetitive sequences • 431 views

ADD COMMENT • link updated 5 weeks ago by rfran010 ★ 1.4k • written 5 weeks ago by frarodmar17 • 0

score 0 · Answer 1 · 2025-02-20

0

Entering edit mode

5 weeks ago

JC 13k

You can convert the RM output (https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.out.gz) to a valid GTF with some script but I guess featureCounts will have problems assigning reads as they will map to multiple locations.

ADD COMMENT • link 5 weeks ago by JC 13k

score 0 · Answer 2 · 2025-02-20

I recommend the Hammell lab GTF files:

https://www.dropbox.com/scl/fo/jdpgn6fl8ngd3th3zebap/ALDQ94uFrf3r1QM1zoT9jHU/TEtranscripts?dl=0&rlkey=41oz6ppggy82uha5i3yo1rnlx&subfolder_nav_tracking=1

Basically reformatted rmsk into a more logical GTF. Gene id is the element name and individual loci are transcript ids. It also keeps track of "family" and "class".

Otherwise, I recommend reformatting the rmsk into SAF format.