Question

How to parallelize bcftools mpileup with GNU parallel?

1

Entering edit mode

5.2 years ago

beausoleilmo ▴ 600

In this thread I saw a line that would run for multiple threads on a computed

parallel 'bcftools mpileup -Ou -f ref.fa -r {} in.bam | bcftools call -m -v -Oz -o {}.vcf.gz' ::: {1..22} X Y

Using the -r flag "-r, --regions chr|chr:pos|chr:from-to|chr:from-[,…]" it is possible to specify a region to look for. Then from there, since my reference genome is not separated in chromosomes (only scaffold level => see the list here)

I get that the command is to divide the task between the 24 chromosomes ({1..22} X Y), but how would it be done if I don't have that for my genome? Also, I'm not sure to understand what the {}.vcf.gz or simply the {} in the output file. Should I put it in the <output.bcf>? Finally, the --jobs flag in parallel will determine the number of scores, but in this case, since it's not flagged, how do I know how many cores it's using (if I have access to 16 cores max)?

This is how I think it should look like, but how do I specify the division of the task (is it better to list the scaffolds (where there are 1000s of them)?). Do I need to put a {} in the output file?

parallel 'bcftools mpileup -C 50 -E -f <ref.genome> --regions {} -b <bam_list>  > <output.bcf>' ::: {SCAFFOLDS??}

bcftools mpileup parallel • 8.7k views

ADD COMMENT • link updated 3.5 years ago by Sarah ▴ 60 • written 5.2 years ago by beausoleilmo ▴ 600

0

Entering edit mode

same comment as previously: C: Running BWA mem in a for loop

ADD REPLY • link 5.2 years ago by Pierre Lindenbaum 166k

3

Entering edit mode

5.2 years ago

beausoleilmo ▴ 600

I think that this would be a solution:

I started with a file that contains the scaffold name and lengths:

JH739887    30495534
JH739888    29527584
JH739889    22321128
[...]

I then added a column with "1" in it to make it the "start" position. Since the scaffolds are not placed on the genome, they all start at one (at least from what I understand)

awk '$1 = $1 FS "1"' chrom.sizes > chrom.start.stop.sizes

Remove all the spaces to create a tab delimited file

tr ' ' '\t' < chrom.start.stop.sizes > chrom.start.stop_tabs.sizes

The file should look like this:

JH739887    1   30495534
JH739888    1   29527584
JH739889    1   22321128
[...]

Then split (gsplit because I'm on a mac!) the files so that they each contain about 2724 scaffolds (the total number of scaffolds is 27239 so 27239/10 ~2724 so the last file will have 1 scaffold less than all the other files).

gsplit --numeric-suffixes=1 -n l/10 --additional-suffix='.txt' chrom.start.stop_tabs.sizes scaff

The first file should have about 2724 lines which correspond to the scaffolds and look like this:

JH739887    1   30495534
JH739888    1   29527584
JH739889    1   22321128
[...]

When I used wc -l on the 10 files, I had that:

2501+2618 +2618 +2749 +2792 +2792 +2792 +2793 +2792 +2792  = 27239 total

Then, after the generation of bam files AND the samtools indexing of the bam files, use this code to make it parallel.

parallel -j 15 'bcftools mpileup -Ou -f ref.fa -R scaff$(printf "%02d" {}).txt -b bam_list.txt | bcftools call -m -v -Oz -o myfile{}.vcf.gz' ::: {1..10}

Some explainations:

-j 15: use 15 cores
-R scaff$(printf "%02d" {}).txt: will be printing the file name with pads of zeros (so it's going to be scaff01.txt, etc.)
- {}: is a placeholder in parallel which is going to be replaced by what is after the ::: which is the `{1..10}`.

Improvement in the future: I realized that the scaffolds were ordered by size. So the scaffolds at the end of the list will run VERY fast and the one at the top of the list VERY slow. So mixing the scaffolds in the files in gsplit could be nice. In the end, this might not be ideal, but at least, a solution for people who want to experiment with this code!

UPDATE: To shuffle do this;

Before the line:

awk '$1 = $1 FS "1"' chrom.sizes > chrom.start.stop.sizes

Add:

shuf chrom.sizes > chrom.start.stop.sizes.shuffled

And change the name of the file accordingly. You could set a seed by looking at this post which I paste the function here:

get_seeded_random()
{
  seed="$1"
  openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt \
    </dev/zero 2>/dev/null
}

shuf -i1-100 --random-source=<(get_seeded_random 42)

Concatenating the files: at the end of the parallel command, there will be 10 different files that need to be put together. Here is a small script that'll do it:

for i in *.vcf.gz
do 
echo `basename $i`
done | sort -V >> vcf_list.txt

bcftools concat --file-list vcf_list.txt -Ob --output output.bcf

ADD COMMENT • link 5.2 years ago by beausoleilmo ▴ 600

score 3 · Accepted Answer · 2020-01-27

The trick here is to use the random access (so the BAM index) to pull out only the reads you want from the BAM file in order to save time. This code snippet will parallelize the mpileup per scaffold given you have a BED file with the chr-start-end. I did only do a minimal testing and the script looks a bit bulky but should actually work.

The idea is the following: The samtools view will use the region information to only extract the overlapping reads and send them to mpileup. Therefore mpileup only processes exactly these reads, so in your case the contigs one-by-one, but parallelized via GNU parallel. Eventually, the intermediate BCFs are concatenated.

YOUR_BAM="your.bam"
CONTIGS="your_file_with_the_contigs.bed"
NUMBER_OF_JOBS = 8
## Index bam file:
samtools index ${YOUR_BAM}

## A function for bcftools where samtools view extracts only the indicated regions:
function MPILEUP {

  BAM=$1 
  REGION=$2
  samtools view -bu -r ${REGION} $BAM \
  | bcftools mpileup (options...) -o output_fromBAM_${BAM%.bam}_region_${REGION}.bcf -O b -

}; export -f MPILEUP

cat $CONTIGS | awk '{print $1":"$2"-"$3}' | parallel -j ${NUMBER_OF_JOBS} "MPILEUP ${YOUR_BAM} {}"

## Concatenate everything:
bcftools concat -O b output_fromBAM_${YOUR_BAM%.bam}_region_*.bcf > concat_fromBAM_${YOUR_BAM%.bam}.bcf

## remove temporary files:
rm output_fromBAM_${YOUR_BAM%.bam}_region_*.bcf

Here NUMBER_OF_JOBS would be 8 so 8 parallel cores working on the task. I would check with top if your harddrive can handle this in terms of I/O bottleneck. This works well on a HPC, not so sure on a workstation with HDD. Also, since mpileup will read BAM from stdinthe sample name in the 10th column of the VCF/BCF will be - so you would need to check if mpielup can handle an external sample name to feed in via the command line, or you have to awk it in manually.