Question

bcftools mpileup parallel run in BASH - creating too many files

0

Entering edit mode

2.8 years ago

Jamie ▴ 10

Hi, I have been modifying this script and it runs, allowing me to produce the desired output. However, I also get unwanted output as well. For example, the script seems to take the ends of files in the current directory (.txt, .bcf, .bam) and combines them into files ending .txt.bam or .bam.bcf. So the problem is that it creates a large number of excess files which are not needed.

#!/bin/bash

set - eu


genome=/storage1/ref_genome/GRCh38.dxv.fa
export genome


function parallel_call {
    bcftools mpileup \
        --fasta-ref ${genome} \
        --regions $2 \
        --output-type u \
        $1 | \
    bcftools call --multiallelic-caller \
              --variants-only \
              --output-type u - > ${1/.bam/}.$2.bcf
}

export -f parallel_call


function bam_chromosomes {
        samtools idxstats $1 | cut -f 1 | grep -w -f search_file.txt
}

export -f bam_chromosomes



chrom_set=`bam_chromosomes test_file.bam`
parallel --verbose -j 90% parallel_call test_file.bam ::: ${chrom_set}

I have tried specifying to grep the exact contigs I want (chr1,chr2,chr3 ... chr22) but am not sure how to get around this problem. Any ideas would be great. The original code was found at: https://www.danielecook.com/using-gnu-parallel-for-bioinformatics/

BASH BCFtools parallel • 1.5k views

ADD COMMENT • link 2.8 years ago by Jamie ▴ 10

0

Entering edit mode

I'm not sure parallel is the right tool here. While it's powerful you should have look at Make and the workflow managers like snakemake or nextflow

ADD REPLY • link 2.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

see Is it possible to run variant calling software in parallel, are there any Shell and Python/R code examples available? for an example with make

ADD REPLY • link 2.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

cross-posted: https://stackoverflow.com/questions/71307235

ADD REPLY • link 2.8 years ago by Pierre Lindenbaum 164k

score 0 · Answer 1 · 2022-03-01

0

Entering edit mode

2.8 years ago

Istvan Albert 102k

You are instructing the tool to generate all possible combinations:

parallel echo {} ::: A B C ::: 1 2 3

prints:

A 1
A 2
A 3
B 1
B 2
B 3
C 1
C 2
C 3

then inside the function, you are saving into each of your combinations. Hence the many files you observe.

So the output is what you instructed it to do. It is not clear from your description what you want. Would you rather have it skip some chromosomes?

If you wish to merge the results then you will need to add another step that does just that with say bcftools concat that concatenates the same samples for different chromosomes. Then you can delete the intermediate files.

PS: note that your invocation seems incorrect, you are including test_bam.bam in both inputs and that probably you meant to write :::: bam_chromosomes to get the input from a file.

ADD COMMENT • link 2.8 years ago by Istvan Albert 102k

0

Entering edit mode

Okay I see, I'll merge only desired results using bcftools concat and remove the intermediate files. Thanks, I'll edit the test_bam.bam inputs.

One of the problems (which is not obvious from here) was the contents of search_file.txt. I had two extra spaces at the end of the file and for some reason, when running the following command:

samtools idxstats test_bam.bam | cut -f 1 | grep -w -f search_file.txt

The output has a * at the end of the list.

Correcting this error and encasing all instances of $1 and $2 in "" seems to have the desired result.

ADD REPLY • link 2.8 years ago by Jamie ▴ 10