Hello-
I'm trying to run RepeatMasker on an Amazon EC2 machine. I know that I can run these files with the standard settings, (so a single sequence at a time), just fine, but it will take 20 hours this way, (I'm analyzing huge files, which I know isn't optimal with RepeatMasker, but there's no other options I'm aware of for retroelement identification and quantification). I see there has been a similar post before, but it wasn't resolved.
I tried the following command RepeatMasker -pa 32 S5.fa
, and it starts off great. Then just after the refining SINE/ALU step, I get the "can't fork" error and it dies. I'm running this on a C3.8Xlarge and using Ubuntu. I tried -pa 16
, 10
, 8
, and 6
. Now at 6 I'm able to proceed, and I've switched from a C3.8Xlarge to a r3.2xlarge. I'm 50 batches in to 5000, and no fork error yet.
Any thoughts on why I'm limited to 6 processors? I'm beyond excited to bump up my analysis speed 6x, but I'm also worried that a few hours in I'll get a fork error and have to start from scratch using the standard settings.
According to top I'm using between 50-70% Cpu and ~13.2% memory
What do you mean by "huge files" are being used? A draft assembly with thousands of scaffolds, or is it millions of unassembled WGS reads?
Sorry I wasn't more explicit, I'm very new to NGS. I've basically taken a miseq run, trimmed adaptors, converted from fastq to fasta, and am running the fasta through repeatmasker. So ~ 2 million sequences or so.
And the reads were SE 150, so running around ~130 with adaptors trimmed
Thanks for the information, I wanted to make sure before I answered.