Entering edit mode
6.5 years ago
Chvatil
▴
130
Hi everyone, I'm actually using MaSuRCA-3.2.6 to assemble my genome and a ran the fallowing script:
#PBS -S /bin/bash
#PBS -l nodes=1:ppn=8:bigmem,mem=100gb
#PBS -e /pandata/ACG-0006_0027/LOGS/ACG-006_assembly.error
#PBS -o /pandata/ACG-0006_0027/LOGS/ACG-006_assembly.out
#PBS -N ACG-006
#PBS -q q1week
DATA
PE= pe 150 22 /pandata/LEPIWASP/ACG-0006_0027/frag_1.fastq /pandata/LEPIWASP/ACG-0006_0027/frag_2.fastq
END
PARAMETERS
#set this to 1 if your Illumina jumping library reads are shorter than 100bp
EXTEND_JUMP_READS=0
#this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content
GRAPH_KMER_SIZE = auto
#set this to 1 for all Illumina-only assemblies
#set this to 1 if you have less than 20x long reads (454, Sanger, Pacbio) and less than 50x CLONE coverage by Illumina, Sanger or 454 mate pairs
#otherwise keep at 0
USE_LINKING_MATES = 0
#specifies whether to run mega-reads correction on the grid
USE_GRID=0
#specifies queue to use when running on the grid MANDATORY
GRID_QUEUE=all.q
#batch size in the amount of long read sequence for each batch on the grid
GRID_BATCH_SIZE=300000000
#coverage by the longest Long reads to use
LHE_COVERAGE=30
#this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms
LIMIT_JUMP_COVERAGE = 300
#these are the additional parameters to Celera Assembler. do not worry about performance, number or processors or batch sizes -- these are computed automatically.
#set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.
CA_PARAMETERS = cgwErrorRate=0.15
#minimum count k-mers used in error correction 1 means all k-mers are used. one can increase to 2 if Illumina coverage >100
KMER_COUNT_THRESHOLD = 1
#whether to attempt to close gaps in scaffolds with Illumina data
CLOSE_GAPS=1
#auto-detected number of cpus to use
NUM_THREADS = 16
#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage
JF_SIZE = 200000000
#set this to 1 to use SOAPdenovo contigging/scaffolding module. Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes from Illumina-only data
SOAP_ASSEMBLY=0
END
Then, I got the asemble.sh file and I ran it as well and got the following .out:
[Sat Jun 16 22:32:45 CEST 2018] Processing pe library reads
[Sat Jun 16 22:49:04 CEST 2018] Average PE read length 150
[Sat Jun 16 22:49:05 CEST 2018] Using kmer size of 49 for the graph
[Sat Jun 16 22:49:06 CEST 2018] MIN_Q_CHAR: 33
WARNING: JF_SIZE set too low, increasing JF_SIZE to at least 1115876884, this automatic increase may be not enough!
[Sat Jun 16 22:49:06 CEST 2018] Creating mer database for Quorum
[Sat Jun 16 23:09:23 CEST 2018] Error correct PE.
[Sat Jun 16 23:11:49 CEST 2018] Error correction of PE reads failed. Check pe.cor.log.
and .error:
/panhome/TOOLS/MaSuRCA-3.2.6/assemble.sh: line 102: 46750 Aborted quorum_error_correct_reads -q $((MIN_Q_CHAR + 40)
) --contaminant=/panhome/TOOLS/MaSuRCA-3.2.6/bin/../share/adapter.jf -m 1 -s 1 -g 1 -a 3 -t 16 -w 10 -e 3 -M quorum_mer_db.jf pe.re
named.fastq --no-discard -o pe.cor.tmp --verbose > quorum.err 2>&1
Does someone have an idea of what is going on here? Thanks for your help.
The 2 fasta files are comming from an illumina Hiseq 3000 150bp and the genome size of my specie is around 1.5 GB.
Also posted on SE: https://stackoverflow.com/questions/50891966/issue-using-masurca-3-2-6-assembler
Although it was done automatically, you should nevertheless increase the
JF_SIZE
value according to the comment by the authors:It was automatically changed to 1115876884 - set it to something higher than that. You're requesting a lot of memory, so, make the most of it. If the genome size is 1.5 gigabase, then multiple that by your target dept of coverage. 1115876884, the value to which the
JF_SIZE
was changed, is only 1.1 billionThis may have caused the subsequent error because your read error correction failed.
Ok I'm trying with JF_SIZE = 25500000000 thank you :)
I tried with this JF_SIZE and got the same thing:
.error:
.out
Can you check the pe.cor.log file that it mentions?
There is no pe.cor.log produced. A Google research on that issue was also not very helpful.
Sure? Is it not in any hidden directory, perhaps?
Actually, I believe you, this issue has been reported elsewhere with no help from anyone:
I was just about to tell you to contact the developers when I found this:
Can you try that?
Yep; I saw it to and it gave me the correct thing:
It is really weird
So, it works now?
No, I mean I did nothing, the frag_.fastaq files were already in
text/plain; charset=us-ascii
Sorry, I am not sure what could be happening (and the question has gone unanswered on other sites, as you have seen).
I actually used MaSuRCA 3.2.2 relatively recently (2017) and it worked fine, but on a much smaller bacterial genome. Here is my config file:
Is there any way of looking at the PBS logs to see if the memory limit was reached?
I posted the PBS log .error and log.out files juste above (#4 post). I'll try with your settings thank you.
Okay, apologies that I could not assist in this case. I would contact the developers, but chances are that they have metaphorically already 'flown the coup'.
Yes I think so too, I tried your settings and got the same thing. Maybe you have another program to advice me for my kind of data and if possible easy to use? I tried ALLPATHS-LG but I have not the data for.
What about Trinity?
But Trinity is for RNA sequence and I have a DNA one
Apologies for the oversight - I am involved in many threads on this website, some with slightly overlapping themes. My other recommendation for genome assembly would be ABySS: http://www.bcgsc.ca/platform/bioinfo/software/abyss