Issue using MaSuRCA-3.2.6
0
0
Entering edit mode
6.5 years ago
Chvatil ▴ 130

Hi everyone, I'm actually using MaSuRCA-3.2.6 to assemble my genome and a ran the fallowing script:

#PBS -S /bin/bash
#PBS -l nodes=1:ppn=8:bigmem,mem=100gb
#PBS -e /pandata/ACG-0006_0027/LOGS/ACG-006_assembly.error
#PBS -o /pandata/ACG-0006_0027/LOGS/ACG-006_assembly.out
#PBS -N ACG-006
#PBS -q q1week


DATA
PE= pe 150 22 /pandata/LEPIWASP/ACG-0006_0027/frag_1.fastq /pandata/LEPIWASP/ACG-0006_0027/frag_2.fastq

END

PARAMETERS
#set this to 1 if your Illumina jumping library reads are shorter than 100bp
EXTEND_JUMP_READS=0
#this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content
GRAPH_KMER_SIZE = auto
#set this to 1 for all Illumina-only assemblies
#set this to 1 if you have less than 20x long reads (454, Sanger, Pacbio) and less than 50x CLONE coverage by Illumina, Sanger or 454 mate pairs
#otherwise keep at 0
USE_LINKING_MATES = 0
#specifies whether to run mega-reads correction on the grid
USE_GRID=0
#specifies queue to use when running on the grid MANDATORY
GRID_QUEUE=all.q
#batch size in the amount of long read sequence for each batch on the grid
GRID_BATCH_SIZE=300000000
#coverage by the longest Long reads to use
LHE_COVERAGE=30
#this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms 
LIMIT_JUMP_COVERAGE = 300
#these are the additional parameters to Celera Assembler.  do not worry about performance, number or processors or batch sizes -- these are computed automatically. 
#set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.
CA_PARAMETERS =  cgwErrorRate=0.15
#minimum count k-mers used in error correction 1 means all k-mers are used.  one can increase to 2 if Illumina coverage >100
KMER_COUNT_THRESHOLD = 1
#whether to attempt to close gaps in scaffolds with Illumina data
CLOSE_GAPS=1
#auto-detected number of cpus to use
NUM_THREADS = 16
#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage
JF_SIZE = 200000000
#set this to 1 to use SOAPdenovo contigging/scaffolding module.  Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes from Illumina-only data
SOAP_ASSEMBLY=0
END

Then, I got the asemble.sh file and I ran it as well and got the following .out:

 [Sat Jun 16 22:32:45 CEST 2018] Processing pe library reads
    [Sat Jun 16 22:49:04 CEST 2018] Average PE read length 150
    [Sat Jun 16 22:49:05 CEST 2018] Using kmer size of 49 for the graph
    [Sat Jun 16 22:49:06 CEST 2018] MIN_Q_CHAR: 33
    WARNING: JF_SIZE set too low, increasing JF_SIZE to at least 1115876884, this automatic increase may be not enough!
    [Sat Jun 16 22:49:06 CEST 2018] Creating mer database for Quorum
    [Sat Jun 16 23:09:23 CEST 2018] Error correct PE.
    [Sat Jun 16 23:11:49 CEST 2018] Error correction of PE reads failed. Check pe.cor.log.

and .error:

/panhome/TOOLS/MaSuRCA-3.2.6/assemble.sh: line 102: 46750 Aborted                 quorum_error_correct_reads -q $((MIN_Q_CHAR + 40)
) --contaminant=/panhome/TOOLS/MaSuRCA-3.2.6/bin/../share/adapter.jf -m 1 -s 1 -g 1 -a 3 -t 16 -w 10 -e 3 -M quorum_mer_db.jf pe.re
named.fastq --no-discard -o pe.cor.tmp --verbose > quorum.err 2>&1

Does someone have an idea of what is going on here? Thanks for your help.

The 2 fasta files are comming from an illumina Hiseq 3000 150bp and the genome size of my specie is around 1.5 GB.

Assembly assembler • 5.0k views
ADD COMMENT
0
Entering edit mode

Also posted on SE: https://stackoverflow.com/questions/50891966/issue-using-masurca-3-2-6-assembler

Although it was done automatically, you should nevertheless increase the JF_SIZE value according to the comment by the authors:

...a safe value is estimated_genome_size*estimated_coverage

It was automatically changed to 1115876884 - set it to something higher than that. You're requesting a lot of memory, so, make the most of it. If the genome size is 1.5 gigabase, then multiple that by your target dept of coverage. 1115876884, the value to which the JF_SIZE was changed, is only 1.1 billion

This may have caused the subsequent error because your read error correction failed.

ADD REPLY
0
Entering edit mode

Ok I'm trying with JF_SIZE = 25500000000 thank you :)

ADD REPLY
0
Entering edit mode

I tried with this JF_SIZE and got the same thing:

.error:

line 102: 25712 Aborted                 quorum_error_correct_reads -q $((MIN_Q_CHAR + 40)
) --contaminant=/panhome/bguinet/TOOLS/MaSuRCA-3.2.6/bin/../share/adapter.jf -m 1 -s 1 -g 1 -a 3 -t 16 -w 10 -e 3 -M quorum_mer_db.jf pe.re
named.fastq --no-discard -o pe.cor.tmp --verbose > quorum.err 2>&1

.out

[Sun Jun 17 11:40:30 CEST 2018] Processing pe library reads
[Sun Jun 17 11:50:47 CEST 2018] Average PE read length 150
[Sun Jun 17 11:50:47 CEST 2018] Using kmer size of 49 for the graph
[Sun Jun 17 11:50:48 CEST 2018] MIN_Q_CHAR: 33
[Sun Jun 17 11:50:48 CEST 2018] Creating mer database for Quorum
[Sun Jun 17 12:19:01 CEST 2018] Error correct PE.
[Sun Jun 17 12:35:01 CEST 2018] Error correction of PE reads failed. Check pe.cor.log.
ADD REPLY
0
Entering edit mode

Can you check the pe.cor.log file that it mentions?

ADD REPLY
0
Entering edit mode

There is no pe.cor.log produced. A Google research on that issue was also not very helpful.

ADD REPLY
0
Entering edit mode

Sure? Is it not in any hidden directory, perhaps?

Actually, I believe you, this issue has been reported elsewhere with no help from anyone:

I was just about to tell you to contact the developers when I found this:

Can you try that?

ADD REPLY
0
Entering edit mode

Yep; I saw it to and it gave me the correct thing:

/pandata/LEPIWASP/ACG-0006_0027$ file -b -i frag_1.fastq
text/plain; charset=us-ascii
/pandata/LEPIWASP/ACG-0006_0027$ file -b -i frag_2.fastq
text/plain; charset=us-ascii

It is really weird

ADD REPLY
0
Entering edit mode

So, it works now?

ADD REPLY
0
Entering edit mode

No, I mean I did nothing, the frag_.fastaq files were already in text/plain; charset=us-ascii

ADD REPLY
0
Entering edit mode

Sorry, I am not sure what could be happening (and the question has gone unanswered on other sites, as you have seen).

I actually used MaSuRCA 3.2.2 relatively recently (2017) and it worked fine, but on a much smaller bacterial genome. Here is my config file:

#############################
#CONFIGURATION FILE CONTENTS#
#############################
DATA
    PE = MA 130 50 R1.fq.gz R2.fq.gz

    PE = MB 130 50 R2.fq.gz R2.fz.gz

    PE = PA 130 50 R1.fq.gz R2.fq.gz

    PE = PB 130 50 R1.fq.gz R2.fq.gz
END

####

PARAMETERS
    #set it to the number of cores in the computer to be used for assembly
    NUM_THREADS=16

    #jellyfish hash size, set this to about 10x the genome size.
    JF_SIZE=36400000

    #most of the paired end reads end up in the same super read and thus are not passed to the assembler. Those that do not end up in the same super read are called "linking mates". The best assembly results are achieved by setting this parameter to 1 for Illumina-only assemblies. If you have more than 2x coverage by long (454, Sanger, etc) reads, set this to 0.
    USE_LINKING_MATES=1

    #This is the kmer size to be used for super reads. “auto” is the safest choice. Only advanced users should modify this parameter.
    GRAPH_KMER_SIZE=auto

    #in some cases (especially for bacterial assemblies) the jumping library has too much coverage which confuses the assembler. By setting this parameter you can have assembler down-sample the jumping library to 60x (from above) coverage. For bigger eukaryotic genomes you can set this parameter to 300.
    LIMIT_JUMP_COVERAGE=60

    #these are the additional parameters to Celera Assembler, and they should only be supplied/modified by advanced users. "ovlMerSize=30 cgwErrorRate=0.25 ovlMemory=4GB" should be used for bacterial assemblies; "ovlMerSize=30 cgwErrorRate=0.15 ovlMemory=4GB" should be used for all other genomes
    #CA_PARAMETERS = cgwErrorRate=0.25

    #scaffolding done by SOAPdenovo2 instead of CABOG. This will decrease assembly 
    #Set this to 1 if you would like to perform contigging and
    SOAP_ASSEMBLY=0
END
#####
#END#
#####

Is there any way of looking at the PBS logs to see if the memory limit was reached?

ADD REPLY
0
Entering edit mode

I posted the PBS log .error and log.out files juste above (#4 post). I'll try with your settings thank you.

ADD REPLY
0
Entering edit mode

Okay, apologies that I could not assist in this case. I would contact the developers, but chances are that they have metaphorically already 'flown the coup'.

ADD REPLY
0
Entering edit mode

Yes I think so too, I tried your settings and got the same thing. Maybe you have another program to advice me for my kind of data and if possible easy to use? I tried ALLPATHS-LG but I have not the data for.

ADD REPLY
0
Entering edit mode

What about Trinity?

ADD REPLY
0
Entering edit mode

But Trinity is for RNA sequence and I have a DNA one

ADD REPLY
0
Entering edit mode

Apologies for the oversight - I am involved in many threads on this website, some with slightly overlapping themes. My other recommendation for genome assembly would be ABySS: http://www.bcgsc.ca/platform/bioinfo/software/abyss

ADD REPLY

Login before adding your answer.

Traffic: 1820 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6