Hi, I have been modifying this script and it runs, allowing me to produce the desired output. However, I also get unwanted output as well. For example, the script seems to take the ends of files in the current directory (.txt, .bcf, .bam) and combines them into files ending .txt.bam or .bam.bcf. So the problem is that it creates a large number of excess files which are not needed.
#!/bin/bash
set - eu
genome=/storage1/ref_genome/GRCh38.dxv.fa
export genome
function parallel_call {
bcftools mpileup \
--fasta-ref ${genome} \
--regions $2 \
--output-type u \
$1 | \
bcftools call --multiallelic-caller \
--variants-only \
--output-type u - > ${1/.bam/}.$2.bcf
}
export -f parallel_call
function bam_chromosomes {
samtools idxstats $1 | cut -f 1 | grep -w -f search_file.txt
}
export -f bam_chromosomes
chrom_set=`bam_chromosomes test_file.bam`
parallel --verbose -j 90% parallel_call test_file.bam ::: ${chrom_set}
I have tried specifying to grep the exact contigs I want (chr1,chr2,chr3 ... chr22) but am not sure how to get around this problem. Any ideas would be great. The original code was found at: https://www.danielecook.com/using-gnu-parallel-for-bioinformatics/
I'm not sure parallel is the right tool here. While it's powerful you should have look at
Make
and the workflow managers likesnakemake
ornextflow
see Is it possible to run variant calling software in parallel, are there any Shell and Python/R code examples available? for an example with
make
cross-posted: https://stackoverflow.com/questions/71307235