Freebayes-parallel with large bam file - individual threads running for >6 days - clearly doing something wrong
0
0
Entering edit mode
2.3 years ago

Context: I'm trying to call variants on a sequencing project using pooled genotyping-by-sequencing. Pools consist of 94 samples each, alongside a number of individuals. Sequence data was demultiplexed and then aligned to a reference genome using hisat2, and the resultant bams were merged with samtools merge. The problem bam is an amalgamation of 44 of these pools and 11 individuals (all tagged with their unique read groups) and it's a bit of a monster at about 36GB but needs to be merged into one because the pools represent different populations that I'm comparing.

I've got a couple of other merged bams clocking in at ~4GB and 4.8GB that ran without issue, so the size is clearly a problem here but I think I'm doing something else wrong as well/not doing this the most efficient way.

The code I'm running for the freebayes-parallel is

freebayes-parallel ./BAM_target.regions 120 -0 -f /REFERENCE.fasta -b ./SAMPLES.bam -K --min-coverage 48 -G 5 -F 0.01 -C 2 --limit-coverage 300 > ./OUTPUT.vcf

Target regions were created using the recommended process of creating a bedtools coverage file and splitting it into chunks of equal size. This caused some issues with high coverage regions being split into chunks of 1bp which freebayes does not like. So far I've

  • tried adjusting the number of chunks lower to get rid of 1bp target regions
  • pre-filtering out everything below the minimum and maximum coverage using awk before making the target regions file
  • increasing the number of threads until I owe the sysadmin biscuits (did I manage to fill 540GB of memory on a machine causing it to stop responding? Yes, yes I did)
  • filtering the BAM using a GFF of repetitive regions (ineffective) using bedtools subtract, which cut off the EOF marker
  • currently running an extremely slope bedtools subtract on the BAM using a bedfile derived from the GFF file of repetitive region to see if that does it

I've also run into problems with freebayes-parallel giving me warnings of running out of file handles (ulimit -n = 65535 and I'm the only user on either of the machines I'm running the code on). I know what that means but not how to resolve it without panning 3+ days of work and I'm still holding out hope it's going to complete somehow.

The CPU usage on both machines started off at a nice 70% with lots of progress but has dwindled to 2-3% and files haven't been updated in 3-6 days, I've got a couple of iterations running on each machine currently, from different BAM files, so I don't think that should be the issue here. It's like it hits some of the higher coverage regions of the genome and just grinds to a near halt. It is still working, I think, but it's just beyond slow.

Does anyone see anything I'm doing obviously wrong? Is this to be expected?

calling freebayes variant freebayes-parallel GBS • 2.4k views
ADD COMMENT
0
Entering edit mode

If you are using an alignment-based approach, try running on one chromosome. In the worst case you can run all the different chromosomes and then merge the vcf. And have you considered trying freebayes (not freebayes-parallel), just in case? I know this are very naive suggestions, but I wouldn't know what else to do. You already tried removing with awk all alignments below 48 and above 300 of coverage... that would have been my bet, actually... it is a pity it didn't help.

ADD REPLY
0
Entering edit mode

Thanks, sadly the reference genome is pretty new and not divided into real chromosomes, it consists of 258094 "chromosomes" though I guess I could try binning them by length because some are only a few thousand bp while others are much, much longer. Push comes to shove though you might be onto something there, and I'll give it a go so thanks.

Still, I've just finished running bedtools subtract on the BAM (I'm getting lost with my various tearful attempts to get this to work but I think the previous I just did it on the bedfile for the target regions rather than the BAM itself) and the resultant BAM is about 75% the size of the starting one at least (2.7GB).

I'll try creating the coverage file and using awk to trim it down to just the regions with sufficient coverage before splitting it into target regions and see if that looks any more reasonable before proceeding. I don't know how freebayes handles --max-coverage and whether regions hugely in excess of that get skipped or subsetted or something and that might be what's tying it up but with --max-coverage 300 for 55 samples that is still a lot to handle for any region with deep coverage depth.

ADD REPLY
0
Entering edit mode

Another attempt... you can try running freebayes on the separate BAM, without merging. I did this, I had about 100 samples of approx 200Mbytes each (totaling about 22Gb), and I had no issues.

ADD REPLY
0
Entering edit mode

Ah I did that initially but was told not to. Kinda not an option really due to the way I need to call and identify variants between the populations. Separating them out runs the risk of basically excluding alleles that are polymorphic between populations rather than within them.

More importantly, I have an answer to why the problem is occurring (more or less). I don't know why specific regions are taking days to complete but I do know why it causes the whole thing to grind to a halt. The script that applies GNU parallel to freebayes is as follows:

#$command | head -100 | grep "^#" # generate header
# iterate over regions using gnu parallel to dispatch jobs
cat "$regionsfile" | parallel -k -j "$ncpus" "${command[@]}" --region {}
) | vcffirstheader \
    | vcfstreamsort -w 1000 | vcfuniq # remove duplicates at region edges

The important bit there is parallel -k, which means "keep order". The tl;dr: is that parallel eventually runs out of names for the files it's using, and when a job completes, they get reused. These names get recycled as it completes them, but with the argument -k, they have to be recycled in order, so if it's got 30 available names and the task on name 4 is stuck, even though tasks 1-3 and 5-30 are all finished, it cannot initialise any more until 4 completes because it'd already rolled over to 1, 2, and 3 but it cannot jump to 5, because it has to assign the next one to name 4.

Incidentally, I've just, due to fat fingers hitting the wrong key, found out I can identify which regions are sticking! When running top, hitting c expands the command column, which gives me something like this, meaning I know this slower task is running on region 620374:450-452

freebayes -0 -f /scratch/tobyi/reference/Feov_v1.0.fasta -b ./3.bam -K --min-coverage 48 -G 5 -F 0.01 -C 2 --limit-coverage 300 --region 620374:450-452

The big question is whether -k is necessary for running freebayes-parallel or not, it won't actually fix the issue of the problematic regions taking forever but, if it doesn't actually cause the whole VCF output to break then you could probably let the whole thing churn through everything that doesn't take a week to analyse and come back to the problematic regions.

Edit: There is a workaround here, for anyone who finds this instead: here you go

You split the workstream, run freebayes with parallel but don't keep order, passing the many VCF files to a designated tmp directory, this allows the whole thing to complete much faster as slower regions don't prevent faster regions from competing. (Example: if you've got 5000 regions, most of which complete in a few minutes, but 10 of them will take 12 hours, running with parallel -k may take ~5 days to complete. Without -k it'll likely take less than 24 hours, probably closer to 12.)

You might have to wait around a while for those to finish still, but the whole process doesn't get repeatedly stopped on those few. Depending on the size of the regions you may want to either exclude them or manually break them down into a number of smaller regions and rerun them, outputting to the same directory.

Once you've done this you pipe the contents of the directory to the second part of the script, and if you've named the outputs with the target regions then they'll be in order and it takes next to no time to combine them into one big VCF.

cat [target regions file] | parallel -j [threads] freebayes [your freebayes parallel arguments here]\
--region {} '>' /PATH/TO/TMP/{}.vcf

cat /PATH/TO/TMP/*.vcf |  vcffirstheader | vcfstreamsort -w 1000 | vcfuniq > [final vcf name].vcf

vcfstreamsort can only sort streamed in VCF inputs that are within a certain range of each other, hence the need for -k in the freebayes-parallel script, however once your many small files have been made they should be in order anyway so can be run straight through that pipe with no issue.

ADD REPLY

Login before adding your answer.

Traffic: 2406 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6