Entering edit mode
3.4 years ago
Jordi
▴
60
Hi,
I am building a clinical genetics germline variants detection secondary analysis pipeline from scratch, mainly for WES and WGS data. I am following GATK's best practice guidelines, specifically:
The resulting pipeline involves the following tools:
Picard FastqToSam
Picard MarkIlluminaAdapters
Picard SamToFastq
BWA-MEM
Picard MergeBamAlignment
Picard MarkDuplicates
GATK BaseRecalibrator
(based on dbSNP common variants.vcf
file)GATK ApplyBQSR
Picard ValidateSamFile
GATK HaplotypeCaller
GATK CNNScoreVariants
GATK FilterVariantTranches
GATK Funcotator
I will extend the pipeline to include SV/CNV calling in the nearfuture; however, I wanted to get inputs from the community as to whether any of the steps listed here is redundant and not necessary and, if so, why. All of these steps are computationally intensive and require a significant amount of time to complete on WGS data.
Why do you reinvent the wheel? There are dozens of these kinds of pipelines already available, for example sarek from nf-core (nextflow) that builds upon GATK best practices: https://github.com/nf-core/sarek
Do you really want to build these very common things from scratch?
Not really re-inventing anything. We have been using an old pipeline for years now. It is time to update to the latest versions and recommendations. We want to have more control over the workflow and be able to tweak it, if necessary.
Ready-made pipelines are often outdated very fast and managing versions and libraries can be a problem.
I will look into it, though.
nf-core pipelines are actively maintained and you can either use the provided container images or use one you make yourself so software versions are not an issue. I see your point though of the desire to have control. Maybe you can use it simply as a template to get inspiration from.