Question

Reading single-end reads takes forever

0

Entering edit mode

8.3 years ago

RoseString ▴ 10

Dear Abyss developers,

Background: I recently had success in using Abyss 2.0.2 to assemble my SE (25x), PE (25x) and MP (50x) reads into an assembly with scaffold N50 of 2Mb, which is relatively good. However, the unitig N50 is only 4 Kb, so lots of Ns are present in the sequences. It seems that abyss will just align PE and MP reads to the unitigs assembled by SE reads only, therefore a big proportion of PE and MP sequences are wasted (in the 'Different' category). For example, less than 40% of 10kb MP libraries can be aligned. With my very low SE coverage, I feel it is not very ideal to be starting material.

Approach: I am trying to concatenate all SE, PE and MP reads into "super-SE" reads to increase the unitig N50 and to improve subsequent PE and MP alignment efficiency. I have done very strict quality control of my MP reads to remove Nextera adaptors and transposes, so I don't think there are chimeras (defined as reads combining two fragments that are far apart). After constructing the super-SE reads by concatenating all fastq.gz files, I redid the assembly with the following command:

abyss-pe np=16 name=SWS k=66 pe='pe1' mp='mp1 mp2 mp3 mp4' \
se='SWS_super_SE.trimmomatic.fq.gz' \
pe1='SWS_PE_1.trimmomatic.fq.gz SWS_PE_2.trimmomatic.fq.gz' \
mp1='SWS_MP_1-4Kb_1.trimmomatic.fq.gz SWS_MP_1-4Kb_2.trimmomatic.fq.gz' \
mp2='SWS_MP_4-7Kb_1.trimmomatic.fq.gz SWS_MP_4-7Kb_2.trimmomatic.fq.gz' \
mp3='SWS_MP_7-10Kb_1.trimmomatic.fq.gz SWS_MP_7-10Kb_2.trimmomatic.fq.gz' \
mp4='SWS_MP_10-15Kb_1.trimmomatic.fq.gz SWS_MP_10-15Kb_2.trimmomatic.fq.gz'

Problem: Now it has taken several days to read the "super-SE" fastq.gz file. The following log is all I have got.

mpirun --mca btl_sm_use_knem 0 -np 16 ABYSS-P -k66 -q3
--coverage-hist=coverage.hist -s SWS-bubbles.fa -o SWS-1.fa SWS_super_SE.trimmomatic.fq.gz

ABySS 2.0.2

ABYSS-P -k66 -q3 --coverage-hist=coverage.hist -s SWS-bubbles.fa -o SWS-1.fa SWS_super_SE.trimmomatic.fq.gz

Running on 16 processors

1: Running on host iw-k32-34

...

...

0: Running on host iw-k32-34

0: Reading `SWS_super_SE.trimmomatic.fq.gz'...

Troubleshooting: Based on my past experience with Abyss, it seems strange for it to take several days to read 80G fastq.gz files. There are several possible reasons I could think of:

PE and MP /1 and /2 reads have same read names (just one has 1 and the other has 2), so Abyss runs into some hashing problems for the super-SE. I therefore concatenated only /1 reads from PE and MP. However, the same issue persists.
Some problem with openmpi, which I have little knowledge in.

Any ideas what could have gone wrong? Thank you very much in advance!

abyss • 3.4k views

ADD COMMENT • link updated 8.3 years ago by benv ▴ 730 • written 8.3 years ago by RoseString ▴ 10

score 5 · Accepted Answer · 2017-04-26

Hi @RoseString,

Thanks for providing so many details about your assembly job and in such a nicely organized format. It really saves time and helps me to understand what is going on.

The se=..., pe=..., mp=...,in=...,lib=... parameters are a frequent point of confusion for ABySS users, unfortunately. Your understanding of the se/pe/mp parameters is fundamentally correct -- se is used by the unitig stage only, pe is used for alignment only (contig stage), and mp is used for alignment only (scaffold stage). However you may use the lib parameter to indicate paired-end libraries to be used in both unitig assembly and for alignment during the contig stage.

For me, the usual procedure for assembly is to use both the single end and paired-end reads for the unitig stage (as you have suggested), the paired-end reads for alignment in the contig stage, and the mate pair reads for alignment in the scaffolding stage. To do that you could change your command to:

abyss-pe np=16 name=SWS k=66 lib='pe1' mp='mp1 mp2 mp3 mp4' \
se='SWS_SE.trimmomatic.fq.gz' \
pe1='SWS_PE_1.trimmomatic.fq.gz SWS_PE_2.trimmomatic.fq.gz' \
mp1='SWS_MP_1-4Kb_1.trimmomatic.fq.gz SWS_MP_1-4Kb_2.trimmomatic.fq.gz' \
mp2='SWS_MP_4-7Kb_1.trimmomatic.fq.gz SWS_MP_4-7Kb_2.trimmomatic.fq.gz' \
mp3='SWS_MP_7-10Kb_1.trimmomatic.fq.gz SWS_MP_7-10Kb_2.trimmomatic.fq.gz' \
mp4='SWS_MP_10-15Kb_1.trimmomatic.fq.gz SWS_MP_10-15Kb_2.trimmomatic.fq.gz'

If you want to try using the mate pair in the unitig stage as well, just add the mate pair files files to the se variable:

abyss-pe np=16 name=SWS k=66 lib='pe1' mp='mp1 mp2 mp3 mp4' \
se='SWS_SE.trimmomatic.fq.gz SWS_MP_1-4Kb_1.trimmomatic.fq.gz SWS_MP_1-4Kb_2.trimmomatic.fq.gz SWS_MP_4-7Kb_1.trimmomatic.fq.gz SWS_MP_4-7Kb_2.trimmomatic.fq.gz SWS_MP_7-10Kb_1.trimmomatic.fq.gz SWS_MP_7-10Kb_2.trimmomatic.fq.gz SWS_MP_10-15Kb_1.trimmomatic.fq.gz SWS_MP_10-15Kb_2.trimmomatic.fq.gz' \
pe1='SWS_PE_1.trimmomatic.fq.gz SWS_PE_2.trimmomatic.fq.gz' \
mp1='SWS_MP_1-4Kb_1.trimmomatic.fq.gz SWS_MP_1-4Kb_2.trimmomatic.fq.gz' \
mp2='SWS_MP_4-7Kb_1.trimmomatic.fq.gz SWS_MP_4-7Kb_2.trimmomatic.fq.gz' \
mp3='SWS_MP_7-10Kb_1.trimmomatic.fq.gz SWS_MP_7-10Kb_2.trimmomatic.fq.gz' \
mp4='SWS_MP_10-15Kb_1.trimmomatic.fq.gz SWS_MP_10-15Kb_2.trimmomatic.fq.gz'

There is a abyss-pe man page that describes the usage of the se/pe/mp/in/lib parameters. If you don't have the man page installed, you can get it from the ABySS source tarball. It is in the doc subdirectory (type $ man doc/abyss-pe.1).

I'm not sure why your assembly job is slow. If it is a large genome, it may be normal (what is your expected genome size?). You can much more helpful/reassuring log output if you add v=-v to your abyss-pe command to enable verbose logging. If it goes a long time (e.g. several hours) without outputting any new progress message in verbose mode, something is probably wrong.