Question

Output checks during Snakemake pipeline

0

Entering edit mode

3 months ago

hpapoli ▴ 150

Hello,

I've seen Snakemake pipelines for, for example variant calling that start with Fastq files, does the quality check, trim the files and goes all the way to produce a PCA plot.

What I have trouble understanding is that how one then checks the outputs of various steps that requires to be assessed before deciding the next step. For example, after checking fastq quality, one needs to visually assess the read quality, then decide on filtering, check the quality again and only then proceeds to mapping. Similar situation after mapping, similar after variant calling.

For me, it seems it's best to have multiple workflows, for quality check, for mapping, for variant calling, etc, so the output of each step can be checked first.

Of course, it seems wonderful to be able to have one file for a pipeline from start to finish but I'm wondering if I'm missing something, for example, there is a way to do these checks while having it all in one workflow. How do you usually deal with this?

Thanks so much!

workflow-management snakemake • 456 views

ADD COMMENT • link 3 months ago by hpapoli ▴ 150

score 2 · Answer 1 · 2024-09-02

2

Entering edit mode

3 months ago

Pierre Lindenbaum 164k

I think with snakemake, you can specify the name of the target snakemake targetrulename . So you can call the target for the QC, check the results, remove some samples if needed and re-launch the whole workflow.

ADD COMMENT • link 3 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

You can also have as output files of your workflow the QCs of each steps and from these choose to remove samples afterward.

But one of the point of using workflow manager is to automate a process and ensure reproducibilty. The more you introduce manual filtering steps, the less your analysis will fulfill these 2 criteria.

ADD REPLY • link 3 months ago by raphael.B ▴ 520

0

Entering edit mode

I understand but most of the time, we don't remove the samples. We need to filter them, re-run the quality check and do this in an iterative process until we get the clean data.

ADD REPLY • link 3 months ago by hpapoli ▴ 150

score 1 · Answer 2 · 2024-09-02

While our Snakemake pipeline can be fully automated, we decided that data quality assessment is an important business decision. Therefore, we put in manual efforts to review QC reports before deploying downstream analytical workflows. I talked a little bit more on our approach in this AWS blog post a few years ago—figure 2 at https://aws.amazon.com/blogs/startups/how-stoke-therapeutics-is-turbo-charging-drug-discovery-using-snakemake-and-aws/

Human capital is significantly more expensive than computing costs so we automate as much as possible. The manual review happens in parts of pipelines where a sample needs to be processed together with other samples. For instance, we run STAR's first pass regardless of the sequencing quality. We pause the pipeline and review QCs to decide if a sample is good enough to move forward to STAR's second pass