I think that this would be a solution:
I started with a file that contains the scaffold name and lengths:
JH739887 30495534
JH739888 29527584
JH739889 22321128
[...]
I then added a column with "1" in it to make it the "start" position. Since the scaffolds are not placed on the genome, they all start at one (at least from what I understand)
awk '$1 = $1 FS "1"' chrom.sizes > chrom.start.stop.sizes
Remove all the spaces to create a tab delimited file
tr ' ' '\t' < chrom.start.stop.sizes > chrom.start.stop_tabs.sizes
The file should look like this:
JH739887 1 30495534
JH739888 1 29527584
JH739889 1 22321128
[...]
Then split (gsplit because I'm on a mac!) the files so that they each contain about 2724 scaffolds (the total number of scaffolds is 27239 so 27239/10 ~2724 so the last file will have 1 scaffold less than all the other files).
gsplit --numeric-suffixes=1 -n l/10 --additional-suffix='.txt' chrom.start.stop_tabs.sizes scaff
The first file should have about 2724 lines which correspond to the scaffolds and look like this:
JH739887 1 30495534
JH739888 1 29527584
JH739889 1 22321128
[...]
When I used wc -l
on the 10 files, I had that:
2501+2618 +2618 +2749 +2792 +2792 +2792 +2793 +2792 +2792 = 27239 total
Then, after the generation of bam files AND the samtools indexing of the bam files, use this code to make it parallel.
parallel -j 15 'bcftools mpileup -Ou -f ref.fa -R scaff$(printf "%02d" {}).txt -b bam_list.txt | bcftools call -m -v -Oz -o myfile{}.vcf.gz' ::: {1..10}
Some explainations:
-j 15: use 15 cores
-R scaff$(printf "%02d" {}).txt: will be printing the file name with pads of zeros (so it's going to be scaff01.txt, etc.)
- {}: is a placeholder in parallel which is going to be replaced by what is after the ::: which is the `{1..10}`.
Improvement in the future:
I realized that the scaffolds were ordered by size. So the scaffolds at the end of the list will run VERY fast and the one at the top of the list VERY slow. So mixing the scaffolds in the files in gsplit could be nice. In the end, this might not be ideal, but at least, a solution for people who want to experiment with this code!
UPDATE: To shuffle do this;
Before the line:
awk '$1 = $1 FS "1"' chrom.sizes > chrom.start.stop.sizes
Add:
shuf chrom.sizes > chrom.start.stop.sizes.shuffled
And change the name of the file accordingly. You could set a seed by looking at this post which I paste the function here:
get_seeded_random()
{
seed="$1"
openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt \
</dev/zero 2>/dev/null
}
shuf -i1-100 --random-source=<(get_seeded_random 42)
Concatenating the files: at the end of the parallel command, there will be 10 different files that need to be put together. Here is a small script that'll do it:
for i in *.vcf.gz
do
echo `basename $i`
done | sort -V >> vcf_list.txt
bcftools concat --file-list vcf_list.txt -Ob --output output.bcf
same comment as previously: C: Running BWA mem in a for loop