MaSuRCA mate pair libraries crashing
0
0
Entering edit mode
8.1 years ago
eischzj12 • 0

Hello! I'm having a difficult time figuring out why my MaSuRCA run keeps crashing. I've run it twice now and each run has lasted at least 5 days. The first time I qdel-ed the job myself due to a mismatch between the 'THREAD' and 'ppn'. The second time it crashed itself and I got no output file telling me what I did wrong.

The files that were generated include: combined_0, cutoff.txt, environment.sh, error_correct.log, meanAndStdevByPrefix.pe.txt, pa.renamed.fastq, pe.cor.fa, and pe_data.tmp.

I cross referenced these with this source (http://www.genome.umd.edu/docs/MaSuRCA_QuickStartGuide.pdf) to make sure that I wasn't missing anything, but I didn't find out anything useful. Can I learn anything about my run from these files? If not, what should my next step be?

Here are the contents of my config.txt file:

DATA PE = pa 500 75 /myPath/GSF1092-P1-ampc_S14_R1_001.fastq.gz /myPath/GSF1092-P1-ampc_S14_R2_001.fastq.gz END

PARAMETERS GRAPH_KMER_SIZE=auto USE_LINKING_MATES=1 NUM_THREADS=32 JF_SIZE=22500000000 DO_HOMOPOLYMER_TRIM=0

END

And my qsub:

!/bin/bash --login

PBS -N masurca_qsub

PBS -j oe

PBS -m abe

PBS -M email

PBS -q default

PBS -l nodes=1:ppn=32

workdir=myPath2 cd $workdir

./assemble.sh

Thanks in advance!

masurca mate pair Assembly • 2.7k views
ADD COMMENT
0
Entering edit mode

The error message or log is needed to know the reason. Most of the crashes for denovo assemblies are due to not enough RAM, can you manually change the kmer to a value lower than considered and give it a try. Is you genome-size ~2.25GB, jellyfish itself might crash in the beginning due to RAM insufficiency. Without the error-log nothing can be certain.

ADD REPLY
0
Entering edit mode

It seems the RAM problem, you even not yet generate the jellyfish output. You should check "error_correct.log".

ADD REPLY
0
Entering edit mode

Thank you both, that is very helpful! I've checked the error_correct.log and noticed that most of the content is that it had "skipped pa(some number): no high quality mer". Occasionally it will say "skipped pa(some number): contaminated read". Are these things that you would expect for a RAM issue? Again I tried to research this problem myself, but there isn't much information out there that says what these mean.

ADD REPLY
0
Entering edit mode

It is not a RAM issue, your data quality does not seem to be good. Did you check the data quality prior, was there any data pre-processing involved? Data quality is the first thing to do, then pre-processing followed by assembly.

ADD REPLY
0
Entering edit mode

I did pre-processing of my mate pairs via Trimmomatic, but someone recommended to me that I not trim my data as masurca has a built-in error correction. Should I rerun masurca with the trimmed data?

ADD REPLY
0
Entering edit mode

I have to ask, is that mate-pair or paired-end data? Are you trying to run the assembly directly on mate-pair data? You need to check the duplication rates of your reads first also better check other quality metrics like overrepresented sequences. As Masurca already suggests there seems to be contamination too.

ADD REPLY
0
Entering edit mode

I'm almost certain that my data is mate-pair, but I'm not entirely certain due to the lack of information provided when I got this research project. How can I distinguish between mate-pair and paired-end?

ADD REPLY
1
Entering edit mode

A quick way to find out whether your data is mate-pair reads.

Circularized Duplicate Junction Adapter

CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG

Circularized Single Junction Adapter

CTGTCTCTTATACACATCT

Circularized Single Junction Adapter Reverse Complement

AGATGTGTATAAGAGACAG

using ' grep "one of the above adapter sequence" reads_file' to see whether your reads have mate-pair library adapter. And if your data are mate-pair, you will find the adapter sequence.

ADD REPLY
0
Entering edit mode

DATA PE = pa 500.. represents your data to be paired-end. This is wrong since you say the data is matepair. Usually insert sizes for matepairs are really high while paired-end can go upto 700bp. Try looking into the insert-size distribution and duplication rates, both are high for mate-pairs.

ADD REPLY
0
Entering edit mode

The library construction for mate-pair and pair-end is different. And based on the insert size "500", it seems pair-end. So you either asked the people who sequenced the data or map your reads to close-relate species to estimate the insert size. PS, you can run MaSurCA with pre-process data even the manual suggest not.

ADD REPLY

Login before adding your answer.

Traffic: 2135 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6