Repetitive sequences analysis
2
0
Entering edit mode
5 weeks ago

I am trying to analyse RNA-seq data and I want to use a gtf annotation file of RepeatMasker to quantify the reads, but I am finding one problem with all annotation files from UCSC RepeatMasker: every single file contain repeated sequences identifiers (transcript_id por example). I do not know how to deal with it, because I found on another post that this file does not have problems: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz. However, I have tried to convert it into a valid format for featureCounts function through many ways but I have not been capable of it. I would appreciate whatever new idea or alternative.

Repetitive sequences • 431 views
ADD COMMENT
0
Entering edit mode
5 weeks ago
JC 13k

You can convert the RM output (https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.out.gz) to a valid GTF with some script but I guess featureCounts will have problems assigning reads as they will map to multiple locations.

ADD COMMENT
0
Entering edit mode
5 weeks ago
rfran010 ★ 1.4k

I recommend the Hammell lab GTF files:

https://www.dropbox.com/scl/fo/jdpgn6fl8ngd3th3zebap/ALDQ94uFrf3r1QM1zoT9jHU/TEtranscripts?dl=0&rlkey=41oz6ppggy82uha5i3yo1rnlx&subfolder_nav_tracking=1

Basically reformatted rmsk into a more logical GTF. Gene id is the element name and individual loci are transcript ids. It also keeps track of "family" and "class".

Otherwise, I recommend reformatting the rmsk into SAF format.

ADD COMMENT

Login before adding your answer.

Traffic: 2170 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6