Question

How to solve Freebayes memory issue

1

Entering edit mode

10 months ago

anikcropscience ▴ 270

Hello,

I am running Freebayes variant calling pipeline. I have 1440 bam files from genotype by sequencing data. The genome size is around 540 Mb. I am using the following slurm job script to run the job:

#!/bin/bash
#SBATCH --comment=19900416
#SBATCH --qos=std
#SBATCH --time=10-00:00:00
#SBATCH --nodes=1
#SBATCH --mem=240G
#SBATCH --cpus-per-task=32
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=Freebayes.sh
#SBATCH --mail-type=ALL

# module load
module load freebayes/1.3.6 parallel vcflib python/2.7.15 bcftools

/home/freebayes-parallel <(cat /home/regions.txt ) 32 --report-genotype-likelihood-max --genotype-qualities --strict-vcf --ploidy 2 -f /home/genome.fasta --max-complex-gap 1 --min-alternate-fraction 0.2 --min-alternate-count 0 --min-coverage 0 --min-mapping-quality 1 --use-best-n-alleles 4 --min-base-quality 10 --report-monomorphic --bam-list /fastq/BAM/BAM_RG/MarkDup/gVCF/Bam.list > /home/Freebayes_raw.vcf

After running for 1 day and producing an 80 Gb VCF file, I got an Out of Memory error.

Then I increased the above memory requirement to 320G and also dropped the --genotype-qualities parameter, used --throw-away-complex-obs. It again produced a similar-sized VCF file but took less time and then threw the same Out of Memory error. The bam list contains all the 1440 bam file paths. I also made 100kb chunks of reference genomes using the fasta_generate_regions.py script.

Can anyone please suggest how to overcome the OOM error? Should I increase the memory to 400G? or can I make subsets of the bam file list and then merge all the vcf at the end? I guess this approach is not recommended.

NGS freebayes SNP • 1.9k views

ADD COMMENT • link 9 months ago by anikcropscience ▴ 270

0

Entering edit mode

No idea if there is an option for you to install packages on your cluster i.e. using conda or run singularity images, but in general it is better to use as much up to date software as practical. This does not guarantee that the problem will disappear, but the developers usually are more likely to respond. The current freebayes version in conda is 1.3.7 (no big diff) but it uses the latest python3, htslib 1.20 etc.

ADD REPLY • link 10 months ago by Darked89 4.7k

0

Entering edit mode

Please do not use bioinformatics as a tag unless your post is about the field of bioinformatics itself.

ADD REPLY • link 10 months ago by Ram 45k

0

Entering edit mode

Alright. I will keep that in mind.

ADD REPLY • link 10 months ago by anikcropscience ▴ 270

score 1 · Answer 1 · 2024-06-07

1

Entering edit mode

10 months ago

inedraylig ▴ 70

How did you generate the regions file? Freebayes may gets stuck and use a lot of memory in regions of especially high coverage. It's a good idea to use the coverage_to_regions.py script to generate regions by coverage (ref in the manual and also: https://github.com/freebayes/freebayes/issues/479) and you you may further want to down-sample regions of extremely high coverage using the --max-coverage option (again, in manual).

ADD COMMENT • link 10 months ago by inedraylig ▴ 70

0

Entering edit mode

fasta_generate_regions.py FS_21001_01.fasta.fai 1000000 > regions.txt

Thank you for the suggestion. This is how I generated the region file. Is there anything wrong with this?

ADD REPLY • link 10 months ago by anikcropscience ▴ 270

score 0 · Answer 2 · 2024-06-07

0

Entering edit mode

10 months ago

Darked89 4.7k

Get the tail (say 100 last lines) of your VCF file. Compare it with your regions file. Hopefully this will give you an idea where in the genome is the problem. The problematic region should be the one either covering the last lines in the VCF or the next one. Run the freebayes with the problematic region, maybe with one region padding on each side confirming that indeed the issue is with this region. Get some coverage stats/view it using few BAMs in the IGV. Apart from over the top coverage look for the mapping qualities and mismatches. MAPQ 1 is very low. No idea how complete is your genome but there is a good chance that reads from repetitive regions not included in your fasta genome assembly are being mapped to wrong regions increasing mapping coverage and creating noise (bazylion mismatched bases/indels etc).

BTW, how long are your reads?

ADD COMMENT • link 10 months ago by Darked89 4.7k

1

Entering edit mode

Thank you for the suggestion. Someone else before me used the same dataset (bam and reference genome) to successfully produce a vcf file with the following commands:

freebayes --report-genotype-likelihood-max --genotype-qualities --strict-vcf --ploidy 2 --fasta-reference FS_21001_01.fasta --max-complex-gap 1 --min-alternate-fraction 0.2 --limit-coverage 200 --theta 0.01 --min-alternate-count 0 --min-coverage 0 --min-mapping-quality 1 --use-best-n-alleles 4 --min-base-quality 10 --report-monomorphic --variant-input known-alleles.vcf.gz --only-use-input-alleles --bam-list 3717_sorted.list --targets target_batch0.bed

Now I have modified the code with some less stringent parameters and used parallel to speed up the process. But it did not work. The reads are 150 bp long.

ADD REPLY • link 10 months ago by anikcropscience ▴ 270

0

Entering edit mode

With 150bp reads and 540Mbp genome (assuming not a madly polymorphic Ciona-like thingy) you can easily ramp MAPQ to 30 to get rid of more dubious mappings. Which brings me to another question: are your BAMs primary alignment only? Marked duplicates?

re freebayes-parallel: from what I got 32 in your command (ncpus) corresponds to a number of regions processed in parallel, not as I was hoping number of BAM files being read in parallel by different threads. Since the data for each region is stored separately, this means that the RAM used in the worst case scenario can be up to 32x of he RAM used by the command used by your predecessor. Apart from (not sure if you are not already doing this) reducing the input size (MAPQ, duplicates) there may be not a simple way of getting a single VCF on a single computing node in a fast way.

But instead of running freebayes-parallel with ncpus=32 on a single node you can split the regions file into say 100 chunks and submit 100 freebayes-parallel jobs ncpus=2 (or 4 ) with way less RAM and then merge 100 result VCFs.

I expect some jobs to fail for one reason or another, so I suggest to switch to a workflow manager (i.e. Nextflow), so you will not need to restart everything from scratch.

ADD REPLY • link 10 months ago by Darked89 4.7k

1

Entering edit mode

Hello, thanks a lot for the detailed suggestion. FYI, my genome is highly heterozygous. All the bam files are correctly aligned, sorted, duplicate marked, and RG added. I will try the way of splitting the region file and the submit freebayes parallel job. I guess that is a very good suggestion. Thank you.

ADD REPLY • link 10 months ago by anikcropscience ▴ 270