I am playing around "preparing" my data for shotgun assembly with abyss.
I have the following data, bird genome:
2x PE libraries (500 and 1000bp)
2x LMP libraries (3kb and 10kb)
All data were produced as PE100 Hiseq runs.
LMP libs have been trimmed and rev-complemented with Illumina's nxtrim.
In a first attempt I wanted to merge all my libraries (separately) with konnector2.
I chose a few kmers ranging from 31 to 91 to get an idea of the best one. The size of the bloom filter was pretty small, ~2G which led to a relatively high FPR.
Now I restarted a few konnector2 jobs with a large bloom filter of 128G .. this seems to take ages on a pretty potent server ..
Now I have a few questions regarding konnector2 and its corresponding workflow:
even after reading the paper I am not really sure how to choose the "best" bloom filter size. Any rules of thumb?
when I merge all my libraries, how do I choose my assembly setup with abyss? I use the merged fasta files as se libs input for abyss. I use the "leftover" fastq files as intended (pe or mp libs). That's it? How is your strategy assembling such dataset?
Regarding your first question about long run times with Konnector2:
It is important that the Bloom filter FPR stays below 25%, or the graph search algorithm may run forever. FPR decreases linearly when increasing the Bloom filter size and FPR increases linearly with the number of distinct k-mers in your data set. Sorry, there is no easy way to choose the right Bloom filter size other than learning about Bloom filters and understanding the math (the Konnector2 Bloom filters use a single hash function). But as a quick guideline, I recall needing about 40GB for an 80X human dataset.
Konnector2 has a known issue where it will sometimes stall at low k values (typically values in the range of k=20..60). Unfortunately there is no fix for this at the moment, but it is something to be aware of. Run konnector with the verbose flag (i.e. -v) and you will be able see regular progress messages to verify that your jobs are proceeding successfully.
Regarding your second question, I recommend the following configuration:
se = merged/extended konnector2 reads + unmerged/unextended konnector2 reads
pe = raw PET
mp = raw MPET
ADD COMMENT
• link
updated 5.0 years ago by
Ram
44k
•
written 9.1 years ago by
benv
▴
730
0
Entering edit mode
Hi Ben,
thanks for your helpful suggestions. I will try just as proposed .. :-)
best,
Sven
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 9.1 years ago by
sklages
▴
170
0
Entering edit mode
Hi Sven, you're welcome.
A couple of things I forgot mention regarding konnector performance:
Decreasing the -B (max-branches) parameter can really speed things up. -B specifies the maximum search breadth before "giving up" and I often decrease from the default value of 350 to 100 to get more reasonable run times.
Make sure -F (max fragment length) is set correctly for your PET data. It limits the depth of the graph search and so having it set higher than necessary (default is 1000) can slow things down.
There is some additional computational overhead to using Konnector2's -E (outward extension) and -D (duplication sequence filtering) options. You also may wish to try without those options (i.e. Konnector1 mode).
Related the point above, we have done a lot more assembly testing with Konnector1 mode than with Konnector2 (-E/-D options). So I would say that Konnector1 mode is a 'safer' strategy. But if you want to live on the bleeding edge, feel free :-)
Sorry about the late reply.
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 9.1 years ago by
benv
▴
730
I was wondering if there is a way to have sealer report bloom filter FPR since it is not in the log file?
I would like to run sealer with larger bloom filter sizes. However if I use threads on a linux cluster the mem limit is set per thread... it seems like if I set the sealer bloom filter size to 4G and ask the cluster for 1G x 4 threads my job dies for exceeding mem requests. Does this sound correct to you and if so does that mean I have to request a full bloom filter size worth of memory for each thread? If so requesting 40G for a human genome would become cumbersome if I use threads.
Sorry, I will fix Sealer to report the FPR in the next ABySS release. In the meantime, the only way to find out the FPR is to build the Bloom filter externally first with abyss-bloom. (See here for an abyss-bloom usage example)
I think you original approach is correct -- set the Bloom filter size to 4G and then allocate 1G x 4 for each of the threads. It's just that you needto give each of the threads a little bit of extra memory for "breathing room". The Bloom filter requires exactly 4G but Sealer needs a bit of extra memory for its other data structures (it should be quite small in comparison to Bloom filter itself). Try 1.5G x 4 threads, for example.
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 8.9 years ago by
benv
▴
730
Hi Ben,
thanks for your helpful suggestions. I will try just as proposed .. :-)
best,
Sven
Hi Sven, you're welcome.
A couple of things I forgot mention regarding konnector performance:
Decreasing the
-B
(max-branches) parameter can really speed things up.-B
specifies the maximum search breadth before "giving up" and I often decrease from the default value of 350 to 100 to get more reasonable run times.Make sure
-F
(max fragment length) is set correctly for your PET data. It limits the depth of the graph search and so having it set higher than necessary (default is 1000) can slow things down.There is some additional computational overhead to using Konnector2's
-E
(outward extension) and-D
(duplication sequence filtering) options. You also may wish to try without those options (i.e. Konnector1 mode).Related the point above, we have done a lot more assembly testing with Konnector1 mode than with Konnector2 (
-E
/-D
options). So I would say that Konnector1 mode is a 'safer' strategy. But if you want to live on the bleeding edge, feel free :-)Sorry about the late reply.