How long it takes to run repeat masker on a full genome
0
0
Entering edit mode
8.4 years ago
CAnna ▴ 20

Hi,

I am currently running a programs that requires a transposable elements annotation in GTF format. For this, the repeat masker tables from UCSC are used. I am using a new assembly that has not repeat masker table available yet, so I am running repeat masker on the entire genome (Rhesus macaque).

Does anyone that have done this before could tell me about how long it can take? I am running this with the "slow" option.

Thank you, CAnna

Assembly • 5.7k views
ADD COMMENT
1
Entering edit mode

The last time I ran that I think it took a week or two to finish...and that was after splitting it by chromosome.

ADD REPLY
0
Entering edit mode

Oh, I did not expect something so long! The splitting by chromosome is a good strategy though. So you split your sequence by chromosome, run rmsk on each of them and then join the masked genome and rmsk table after right?

Sorry, I'm kind of new with this, what kind of tool can I use to split the fasta sequence by chromosome? samtools maybe?

ADD REPLY
0
Entering edit mode

Yup, exactly. I think there are faster alternatives to repeatmasker these days, though I've been lucky enough that I haven't needed to do this in years (since just after mm10 came out, since it hadn't been repeatmasked at that time).

ADD REPLY
0
Entering edit mode

Ok good, Thank you! CAnna

ADD REPLY
0
Entering edit mode

@Devon Ryan which alternatives were you talking about? Would be very interested to find out, as my job has currently been running for too long! (Genome size 870Mb) :)

ADD REPLY
0
Entering edit mode

Roughly how big is the genome in Mbp? What species setting are you using, is it "all"?

ADD REPLY
0
Entering edit mode

Hi, thank you for your reply The genome is about 2818 Mbp long. I set the species to "macaca mulatta". Here is the command

RepeatMasker -species "macaca mulatta" -s -par 10 MacaM_Rhesus_Genome_v7.fasta

I was wondering if it not even too specific, maybe I should put "primate".

ADD REPLY
0
Entering edit mode

Perhaps downloading the already masked version from UCSC would be a preferable option?

ADD REPLY
0
Entering edit mode

for those of us who map to the Ensembl version of the genome, it has to be done manually, unfortunately

ADD REPLY
1
Entering edit mode

No you don't need to. Ensembl has two kinds of repeat masked versions available. Here is human genome: https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/

If you are looking for the Rhesus version: https://ftp.ensembl.org/pub/current_fasta/macaca_mulatta/dna/

* 'dna_rm' - masked genomic DNA.  Interspersed repeats and low
     complexity regions are detected with the RepeatMasker tool and masked
     by replacing repeats with 'N's.
  * 'dna_sm' - soft-masked genomic DNA. All repeats and low complexity regions
    have been replaced with lowercased versions of their nucleic base
ADD REPLY
0
Entering edit mode

thank you GenoMax, you are correct, but if one needs specifically GTF files (like they require in the Velocyto single-cell velocity inference pipeline), they aren't provided by Ensembl, unfortunately...

ADD REPLY

Login before adding your answer.

Traffic: 1890 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6