Hello! I'm having a difficult time figuring out why my MaSuRCA run keeps crashing. I've run it twice now and each run has lasted at least 5 days. The first time I qdel-ed the job myself due to a mismatch between the 'THREAD' and 'ppn'. The second time it crashed itself and I got no output file telling me what I did wrong.
The files that were generated include: combined_0, cutoff.txt, environment.sh, error_correct.log, meanAndStdevByPrefix.pe.txt, pa.renamed.fastq, pe.cor.fa, and pe_data.tmp.
I cross referenced these with this source (http://www.genome.umd.edu/docs/MaSuRCA_QuickStartGuide.pdf) to make sure that I wasn't missing anything, but I didn't find out anything useful. Can I learn anything about my run from these files? If not, what should my next step be?
Here are the contents of my config.txt file:
DATA PE = pa 500 75 /myPath/GSF1092-P1-ampc_S14_R1_001.fastq.gz /myPath/GSF1092-P1-ampc_S14_R2_001.fastq.gz END
PARAMETERS GRAPH_KMER_SIZE=auto USE_LINKING_MATES=1 NUM_THREADS=32 JF_SIZE=22500000000 DO_HOMOPOLYMER_TRIM=0
END
And my qsub:
!/bin/bash --login
PBS -N masurca_qsub
PBS -j oe
PBS -m abe
PBS -M email
PBS -q default
PBS -l nodes=1:ppn=32
workdir=myPath2 cd $workdir
./assemble.sh
Thanks in advance!
The error message or log is needed to know the reason. Most of the crashes for denovo assemblies are due to not enough RAM, can you manually change the kmer to a value lower than considered and give it a try. Is you genome-size ~2.25GB, jellyfish itself might crash in the beginning due to RAM insufficiency. Without the error-log nothing can be certain.
It seems the RAM problem, you even not yet generate the jellyfish output. You should check "error_correct.log".
Thank you both, that is very helpful! I've checked the error_correct.log and noticed that most of the content is that it had "skipped pa(some number): no high quality mer". Occasionally it will say "skipped pa(some number): contaminated read". Are these things that you would expect for a RAM issue? Again I tried to research this problem myself, but there isn't much information out there that says what these mean.
It is not a RAM issue, your data quality does not seem to be good. Did you check the data quality prior, was there any data pre-processing involved? Data quality is the first thing to do, then pre-processing followed by assembly.
I did pre-processing of my mate pairs via Trimmomatic, but someone recommended to me that I not trim my data as masurca has a built-in error correction. Should I rerun masurca with the trimmed data?
I have to ask, is that mate-pair or paired-end data? Are you trying to run the assembly directly on mate-pair data? You need to check the duplication rates of your reads first also better check other quality metrics like overrepresented sequences. As Masurca already suggests there seems to be contamination too.
I'm almost certain that my data is mate-pair, but I'm not entirely certain due to the lack of information provided when I got this research project. How can I distinguish between mate-pair and paired-end?
A quick way to find out whether your data is mate-pair reads.
Circularized Duplicate Junction Adapter
CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
Circularized Single Junction Adapter
CTGTCTCTTATACACATCT
Circularized Single Junction Adapter Reverse Complement
AGATGTGTATAAGAGACAG
using ' grep "one of the above adapter sequence" reads_file' to see whether your reads have mate-pair library adapter. And if your data are mate-pair, you will find the adapter sequence.
DATA PE = pa 500.. represents your data to be paired-end. This is wrong since you say the data is matepair. Usually insert sizes for matepairs are really high while paired-end can go upto 700bp. Try looking into the insert-size distribution and duplication rates, both are high for mate-pairs.
The library construction for mate-pair and pair-end is different. And based on the insert size "500", it seems pair-end. So you either asked the people who sequenced the data or map your reads to close-relate species to estimate the insert size. PS, you can run MaSurCA with pre-process data even the manual suggest not.