Question

Tool:elPrep 4.0.0, a high-performance drop-in replacement tool for GATK4/Picard/SAMtools for processing SAM/BAM files

8

Entering edit mode

6.1 years ago

Charlotte.Herzeel ▴ 150

Dear colleagues,

We are happy to announce the release of elPrep 4.0.0, an open-source, drop-in replacement tool for GATK4/Picard/SAMtools for preparing SAM/BAM files for variant calling that produces identical results, while greatly improving computational performance. For more details, see the elprep github repository.

elPrep 4.0.0 introduces multiple new features allowing us to process the preparation steps defined by the GATK Best Practices for variant calling.

New features include:

added base quality score recalibration (BQSR)
added optical duplicate marking
added metrics (MultiQC compatible)
support for SAM File Format version 1.6
support for FASTA and VCF files
support for elPrep-specific elsites and elfasta formats
split/filter/merge (sfm) mode now implemented in Go instead of Python
added --log-path option to all tools
various API and performance improvements
changed license to the GNU Affero General Public License version 3 as published by the Free Software Foundation, with Additional Terms
updated demos

Our benchmarks show that elPrep 4.0.0 executes the sort/deduplicate/recalibrate and apply-BQSR-pipeline from the GATK Best Practices up to 12x faster for WES data and 7.5x faster for WGS data, while utilising similar or fewer compute resources than Picard/GATK4.

Example runtime, RAM use, and disk use for 50x WGS Illumina Platinum Genome NA12878 aligned against hg38. elPrep combines the execution of the 4 pipeline steps for efficient parallel execution.

enter image description here

We are looking forward to your feedback and suggestions.

Thanks a lot!

Kind regards,

Charlotte Herzeel, Exascience Life Lab, Imec, Belgium

sam bam bqsr • 5.3k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 6.1 years ago by Charlotte.Herzeel ▴ 150

2

Entering edit mode

Hi, this is a great tool! I feel you forgot to mention elPrep's modularity by design. Adding/removing filters to our elPrep call to suit our pipeline needs is done in a breeze. This allows us to use it in all sorts of NGS pipelines, and not just the GATK's Best Practices for variant calling. I also really like that -- with a bit of work -- it's not extremely difficult to add new filters to suit our needs. Plus, you guys have always been very responsive to such requests. This is a very efficient and valuable tool for the community. Thanks!

ADD REPLY • link 6.1 years ago by Leonor Palmeira 3.9k

1

Entering edit mode

Thanks! elPrep is indeed designed as a modular plug-in architecture where the implementation of SAM/BAM tools is separated from the engine that parallelises and merges their execution. We have extensive documentation and very much welcome contributions and suggestions for extending elPrep to support different sequencing pipelines!

ADD REPLY • link 6.1 years ago by Charlotte.Herzeel ▴ 150

0

Entering edit mode

Thanks for the API documentation link!

ADD REPLY • link 6.1 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

Hi, I tried this some time ago, and found it made significant assumptions about read names. I.e. data from the SRA or non-illumina sequencers could not be processed. Have these requirements been relaxed in the meantime ?

ADD REPLY • link 6.1 years ago by colindaven 7.0k

1

Entering edit mode

Hi, We only make assumptions about the read names (QNAME) for optical duplicate marking, as they have to encode the tile + coordinates. Is this what you mean? If not, could you provide more details, e.g. the error message you get? Thanks!

ADD REPLY • link 6.1 years ago by Charlotte.Herzeel ▴ 150

1

Entering edit mode

If one does not fastq-dump data from SRA with -F or --origfmt option then one ends up with fastq headers that replace the standard Illumina headers with something that look like this.

@SRR7716298.4 4 length=100
CTGCAATAAGAGCTCGATGTCATTATGTTAAGAAAAAATGGCTCGGAGGTATGGGAACGAAGTGGTATACTACAGAAACGAGACTTCGTAAGTTCAGGTA
+SRR7716298.4 4 length=100
AAAFFJFJJJJAJ<F-F7FFF<-77-7<-7----FF<<77F7AJAJ7JJJJF7AAA<J<-7-<AA-A77F7-AJJ-<A-AJFJJ--<F7AAA-<7A-F77

I believe that is what @colindaven is referring to. Then there are probably headers from other technologies that don't follow the Illumina format.

ADD REPLY • link 6.1 years ago by GenoMax 147k

2

Entering edit mode

Yes, you are right. We have seen the same problem. elPrep currently only supports the Illumina format for optical duplicate marking, which is what GATK4 also supports by default. If you would like us to support other formats, please submit an issue on our github repository so we can discuss this in more detail. Thanks a lot.

ADD REPLY • link 6.1 years ago by Charlotte.Herzeel ▴ 150

1

Entering edit mode

(Edited: solved) looks like ePrep works in all other cases where there is no optical duplication

there is a lot of data in SRA where one cannot recover the original read formatting even if these were originally produced on that instrument.

ADD REPLY • link 6.1 years ago by Istvan Albert 102k

1

Entering edit mode

Nonetheless, it's a nice tool and it's great people are trying to speed up bioinformatics infrastructure akin to what is going in the commercial and semi-commercial world with DRAGEN, MPEG-G and so on. So I will test a bit on Illumina X10 and NextSeq data I have and give feedback on any bugs I encounter.

ADD REPLY • link 6.1 years ago by colindaven 7.0k

1

Entering edit mode

Hi,

I am a bit confused about your remark. We tested elPrep a lot, including on data from SRA archives, but apart from optical duplicate marking, we haven’t encountered any issues because of QNAME fields. When elPrep is not able to recover tile information from the QNAME fields, it will skip optical duplicate marking and log a warning. Any other commands in the elPrep call should continue executing without problems.

There are two other places where elPrep code refers to QNAME fields. One is when sorting reads by queryname. The other is for correlating the two ends of a pair during duplicate marking, and for resolving ties when duplicates have the same phred score. To the best of our knowledge, we are in both cases faithfully reproducing the behaviour of Picard and GATK. We think that even for optical duplicate marking, if Picard sees QNAME fields without tile information, it will also not be able to properly mark optical duplicates.

We are primarily software engineers, so it is certainly possible we may be missing something. If you could clarify what the issue is you are referring to, we are very happy to make an attempt at fixing it.

Thanks a lot for your help.

ADD REPLY • link 6.1 years ago by Charlotte.Herzeel ▴ 150

1

Entering edit mode

I was simply reacting to your statement above where you say:

elPrep currently only supports the Illumina format,

That is a much more restrictive statement than the second statement that you make in your reply:

but apart from optical duplicate marking, we haven’t encountered any issues because of QNAME fields

If you support an Illumina specific functionality (among the many others) that does not mean that the tool "only supports Illumina format". Frankly, there is not even such a thing as "Illumina format", it just happens that for the past few years the most popular instruments produced read names formatted in a certain way, but that is not really a format, nor did Illumina instruments always produced that format.

If the tool works fine on say PacBio data, other than optical marking (which would not even apply there anyway) then it is all good and it is a fair replacement for samtools.

ADD REPLY • link 6.1 years ago by Istvan Albert 102k

0

Entering edit mode

The original elPrep paper describes the sorting and duplicate marking implementations.

Is there a paper in preparation describing the BQSR implementation and the new features?

ADD REPLY • link 6.1 years ago by Leonor Palmeira 3.9k

score 1 · Answer 1 · 2018-12-11

1

Entering edit mode

6.0 years ago

Charlotte.Herzeel ▴ 150

We are happy to announce that a preprint of our new paper describing elPrep 4 is now available. See https://t.co/u6h12mQPhx

ADD COMMENT • link 6.0 years ago by Charlotte.Herzeel ▴ 150

1

Entering edit mode

The final version of our article was just published by PLOS One. See https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0209523

ADD REPLY • link 5.8 years ago by Charlotte.Herzeel ▴ 150

1

Entering edit mode

I read the paper. elPrep seems to be a very promising tool for post alignment sequence data processing. I am going to try it out on WES somatic (paired and tumor only) data. It is great that the GATK v4 best practices steps for post-alignment processing are included in a modular fashion in one single step. I am curious if you are considering including indel realignment in elPrep? Some variant callers now include the indel realignment step as part of variant calling but there are still some popular somatic variant callers that do not and benefit from indel realignment prior to variant calling. Currently, indel realignment options are very limited. Thanks!

ADD REPLY • link 5.8 years ago by roysomak4 ▴ 40

0

Entering edit mode

Thanks a lot for trying elPrep! We will look into indel realignment, but I can't promise we will implement this soon.

ADD REPLY • link 5.8 years ago by Charlotte.Herzeel ▴ 150

0

Entering edit mode

Would you mind telling us which variant callers you are using that need indel realignment? Thanks.

ADD REPLY • link 5.8 years ago by Charlotte.Herzeel ▴ 150

score 0 · Answer 2 · 2019-06-04

0

Entering edit mode

5.5 years ago

Charlotte.Herzeel ▴ 150

Our article where we compare C++, Java, and Go for implementing elPrep has just been published by BMC Bioinformatics. This article describes the advantages and challenges we encountered in these languages when implementing a SAM/BAM tool and motivates why we ended up choosing Go for elPrep.

See: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2903-5

ADD COMMENT • link 5.5 years ago by Charlotte.Herzeel ▴ 150

0

Entering edit mode

We have published a follow-up article where we compare the ease of programming in all three programming languages, concluding that, in our opinion, it is much easier to implement a project such as elPrep in Go or Java rather than C++. This, in addition to its performance benefits, strengthened our motivation to use Go as an implementation language for elPrep.

See: https://journals.sagepub.com/doi/10.1177/1176934319869015

ADD REPLY • link 5.2 years ago by Charlotte.Herzeel ▴ 150

score 0 · Answer 3 · 2021-02-08

We are happy to announce the release of elPrep 5.0.0!

The major new feature of elPrep 5 is the addition of variant calling, which means that elPrep can now execute a full variant calling pipeline on its own, starting from an aligned BAM file, and producing a VCF file. We follow the haplotype caller algorithm, and produce identical results.

elPrep 5 is released as on open source project on Github:

https://github.com/exascience/elprep

Binaries can be downloaded via the following website:

https://www.imec-int.com/en/expertise/lifesciences/genomics/dna-sequence-analysis-software

We have also published a paper that explains the details of elPrep 5 and reports benchmark results, showing 8-16x speedup compared to GATK4:

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0244471

Please feel free to contact us with feedback or questions.