Reading single-end reads takes forever
1
0
Entering edit mode
7.6 years ago
RoseString ▴ 10

Dear Abyss developers,

Background: I recently had success in using Abyss 2.0.2 to assemble my SE (25x), PE (25x) and MP (50x) reads into an assembly with scaffold N50 of 2Mb, which is relatively good. However, the unitig N50 is only 4 Kb, so lots of Ns are present in the sequences. It seems that abyss will just align PE and MP reads to the unitigs assembled by SE reads only, therefore a big proportion of PE and MP sequences are wasted (in the 'Different' category). For example, less than 40% of 10kb MP libraries can be aligned. With my very low SE coverage, I feel it is not very ideal to be starting material.

Approach: I am trying to concatenate all SE, PE and MP reads into "super-SE" reads to increase the unitig N50 and to improve subsequent PE and MP alignment efficiency. I have done very strict quality control of my MP reads to remove Nextera adaptors and transposes, so I don't think there are chimeras (defined as reads combining two fragments that are far apart). After constructing the super-SE reads by concatenating all fastq.gz files, I redid the assembly with the following command:

abyss-pe np=16 name=SWS k=66 pe='pe1' mp='mp1 mp2 mp3 mp4' \
se='SWS_super_SE.trimmomatic.fq.gz' \
pe1='SWS_PE_1.trimmomatic.fq.gz SWS_PE_2.trimmomatic.fq.gz' \
mp1='SWS_MP_1-4Kb_1.trimmomatic.fq.gz SWS_MP_1-4Kb_2.trimmomatic.fq.gz' \
mp2='SWS_MP_4-7Kb_1.trimmomatic.fq.gz SWS_MP_4-7Kb_2.trimmomatic.fq.gz' \
mp3='SWS_MP_7-10Kb_1.trimmomatic.fq.gz SWS_MP_7-10Kb_2.trimmomatic.fq.gz' \
mp4='SWS_MP_10-15Kb_1.trimmomatic.fq.gz SWS_MP_10-15Kb_2.trimmomatic.fq.gz'

Problem: Now it has taken several days to read the "super-SE" fastq.gz file. The following log is all I have got.

mpirun --mca btl_sm_use_knem 0 -np 16 ABYSS-P -k66 -q3
--coverage-hist=coverage.hist -s SWS-bubbles.fa -o SWS-1.fa SWS_super_SE.trimmomatic.fq.gz

ABySS 2.0.2

ABYSS-P -k66 -q3 --coverage-hist=coverage.hist -s SWS-bubbles.fa -o SWS-1.fa SWS_super_SE.trimmomatic.fq.gz

Running on 16 processors

1: Running on host iw-k32-34

...

...

0: Running on host iw-k32-34

0: Reading `SWS_super_SE.trimmomatic.fq.gz'...

Troubleshooting: Based on my past experience with Abyss, it seems strange for it to take several days to read 80G fastq.gz files. There are several possible reasons I could think of:

  1. PE and MP /1 and /2 reads have same read names (just one has 1 and the other has 2), so Abyss runs into some hashing problems for the super-SE. I therefore concatenated only /1 reads from PE and MP. However, the same issue persists.

  2. Some problem with openmpi, which I have little knowledge in.

Any ideas what could have gone wrong? Thank you very much in advance!

abyss • 2.8k views
ADD COMMENT
5
Entering edit mode
7.6 years ago
benv ▴ 730

Hi @RoseString,

Thanks for providing so many details about your assembly job and in such a nicely organized format. It really saves time and helps me to understand what is going on.

The se=..., pe=..., mp=...,in=...,lib=... parameters are a frequent point of confusion for ABySS users, unfortunately. Your understanding of the se/pe/mp parameters is fundamentally correct -- se is used by the unitig stage only, pe is used for alignment only (contig stage), and mp is used for alignment only (scaffold stage). However you may use the lib parameter to indicate paired-end libraries to be used in both unitig assembly and for alignment during the contig stage.

For me, the usual procedure for assembly is to use both the single end and paired-end reads for the unitig stage (as you have suggested), the paired-end reads for alignment in the contig stage, and the mate pair reads for alignment in the scaffolding stage. To do that you could change your command to:

abyss-pe np=16 name=SWS k=66 lib='pe1' mp='mp1 mp2 mp3 mp4' \
se='SWS_SE.trimmomatic.fq.gz' \
pe1='SWS_PE_1.trimmomatic.fq.gz SWS_PE_2.trimmomatic.fq.gz' \
mp1='SWS_MP_1-4Kb_1.trimmomatic.fq.gz SWS_MP_1-4Kb_2.trimmomatic.fq.gz' \
mp2='SWS_MP_4-7Kb_1.trimmomatic.fq.gz SWS_MP_4-7Kb_2.trimmomatic.fq.gz' \
mp3='SWS_MP_7-10Kb_1.trimmomatic.fq.gz SWS_MP_7-10Kb_2.trimmomatic.fq.gz' \
mp4='SWS_MP_10-15Kb_1.trimmomatic.fq.gz SWS_MP_10-15Kb_2.trimmomatic.fq.gz'

If you want to try using the mate pair in the unitig stage as well, just add the mate pair files files to the se variable:

abyss-pe np=16 name=SWS k=66 lib='pe1' mp='mp1 mp2 mp3 mp4' \
se='SWS_SE.trimmomatic.fq.gz SWS_MP_1-4Kb_1.trimmomatic.fq.gz SWS_MP_1-4Kb_2.trimmomatic.fq.gz SWS_MP_4-7Kb_1.trimmomatic.fq.gz SWS_MP_4-7Kb_2.trimmomatic.fq.gz SWS_MP_7-10Kb_1.trimmomatic.fq.gz SWS_MP_7-10Kb_2.trimmomatic.fq.gz SWS_MP_10-15Kb_1.trimmomatic.fq.gz SWS_MP_10-15Kb_2.trimmomatic.fq.gz' \
pe1='SWS_PE_1.trimmomatic.fq.gz SWS_PE_2.trimmomatic.fq.gz' \
mp1='SWS_MP_1-4Kb_1.trimmomatic.fq.gz SWS_MP_1-4Kb_2.trimmomatic.fq.gz' \
mp2='SWS_MP_4-7Kb_1.trimmomatic.fq.gz SWS_MP_4-7Kb_2.trimmomatic.fq.gz' \
mp3='SWS_MP_7-10Kb_1.trimmomatic.fq.gz SWS_MP_7-10Kb_2.trimmomatic.fq.gz' \
mp4='SWS_MP_10-15Kb_1.trimmomatic.fq.gz SWS_MP_10-15Kb_2.trimmomatic.fq.gz'

There is a abyss-pe man page that describes the usage of the se/pe/mp/in/lib parameters. If you don't have the man page installed, you can get it from the ABySS source tarball. It is in the doc subdirectory (type $ man doc/abyss-pe.1).

I'm not sure why your assembly job is slow. If it is a large genome, it may be normal (what is your expected genome size?). You can much more helpful/reassuring log output if you add v=-v to your abyss-pe command to enable verbose logging. If it goes a long time (e.g. several hours) without outputting any new progress message in verbose mode, something is probably wrong.

ADD COMMENT
0
Entering edit mode

Thank you so much for your detailed instruction, @benv. That's really helpful! I'm gonna try your suggested commands.

ADD REPLY
0
Entering edit mode

Hi @benv,

I added v=-v to my command and found that the reading step became really slow when it reached ~30%. I read a post that had a similar problem http://seqanswers.com/forums/showthread.php?t=61602 and followed it to switch from openmpi to mpich3. The speed did improve significantly at the reading step.

Now I am running into some weird problem at the last few steps (see the error message below). This is for k=50. Do you know any way to get around this issue? Thank you again!

n n:1000 L50 min N80 N50 N20 E-size max sum name

3071100 9056 100 1000 803658 2320277 5991757 3703736 15.02e6 962.6e6 s=1000

Best scaffold N50 is 2320277 at s=1000.

PathConsensus -v --dot -k50 -p0.9 -s SWS-7.fa -g SWS-7.dot -o SWS-7.path SWS-6.fa SWS-6.dot SWS-6.path

Reading `SWS-6.dot'...

Reading 'SWS-6.fa'...

Reading 'SWS-6.path'...

Read 2014 paths

/usr/local/bin/abyss-pe:660: recipe for target 'SWS-7.dot' failed

make: * [SWS-7.dot] Segmentation fault (core dumped) make: *

[SWS-7.dot] Deleting file 'SWS-7.fa'

ADD REPLY
0
Entering edit mode

UPDATE: The run has been successfully finished when I switched to Open MPI 2.1.0. Thanks!

ADD REPLY
0
Entering edit mode

Good. Glad to hear that!

ADD REPLY
0
Entering edit mode

how to assemble single-ended fastq file ?? I tried the man-page but everywhere it is mentioned about the paired-end fastq assembly. can we use abyss for single-ended fastq file assembly also??

ADD REPLY
0
Entering edit mode

Sorry for slow response. To run a single-end assembly only, specify the input single-end reads with se="<single-end FASTQ files>" or in="<single-end FASTQ files>" and add the word "unitigs" to the end of the abyss-pe command.

ADD REPLY

Login before adding your answer.

Traffic: 1964 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6