Entering edit mode
8.4 years ago
CAnna
▴
20
Hi,
I am currently running a programs that requires a transposable elements annotation in GTF format. For this, the repeat masker tables from UCSC are used. I am using a new assembly that has not repeat masker table available yet, so I am running repeat masker on the entire genome (Rhesus macaque).
Does anyone that have done this before could tell me about how long it can take? I am running this with the "slow" option.
Thank you, CAnna
The last time I ran that I think it took a week or two to finish...and that was after splitting it by chromosome.
Oh, I did not expect something so long! The splitting by chromosome is a good strategy though. So you split your sequence by chromosome, run rmsk on each of them and then join the masked genome and rmsk table after right?
Sorry, I'm kind of new with this, what kind of tool can I use to split the fasta sequence by chromosome? samtools maybe?
Yup, exactly. I think there are faster alternatives to repeatmasker these days, though I've been lucky enough that I haven't needed to do this in years (since just after mm10 came out, since it hadn't been repeatmasked at that time).
Ok good, Thank you! CAnna
@Devon Ryan which alternatives were you talking about? Would be very interested to find out, as my job has currently been running for too long! (Genome size 870Mb) :)
Roughly how big is the genome in Mbp? What species setting are you using, is it "all"?
Hi, thank you for your reply The genome is about 2818 Mbp long. I set the species to "macaca mulatta". Here is the command
RepeatMasker -species "macaca mulatta" -s -par 10 MacaM_Rhesus_Genome_v7.fasta
I was wondering if it not even too specific, maybe I should put "primate".
Perhaps downloading the already masked version from UCSC would be a preferable option?
for those of us who map to the Ensembl version of the genome, it has to be done manually, unfortunately
No you don't need to. Ensembl has two kinds of repeat masked versions available. Here is human genome: https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/
If you are looking for the Rhesus version: https://ftp.ensembl.org/pub/current_fasta/macaca_mulatta/dna/
thank you GenoMax, you are correct, but if one needs specifically GTF files (like they require in the Velocyto single-cell velocity inference pipeline), they aren't provided by Ensembl, unfortunately...