I am a master’s student (new to bioinformatics) and I need advice on customizing a Freebayes script. For context, I am working with 213 paired end ddRADseq samples that I mapped against a de novo reference genome that was generated from 18 high-quality individuals using dDocent. Currently, I am attempting to call snps and genotypes using Freebayes with these two alterations: 1) I want to consider each individual as its own population and 2) I want to feed a VCF file of the 18 high-quality individuals - which have been separately processed, genotyped and filtered - as a guide for which sites/alleles to call in the remaining samples. I am using freebayes v1.3.1-dirty. Below summarizes what I have tried and where I am stuck:
For separate populations: I could not find a parameter to call each individual as its own population, so I created my own pop file listing each individual under a unique identifier so freebayes will place them into separate “populations” and then fed the file under the --populations flag as a workaround.
For snp guidance: To use a VCF file (bgzipped and tabixed) as a genotyping guide, I used both -@ and -l flags to “use variants reported in VCF file as input to the algorithm” and “only provide variant calls and genotype likelihoods for sites and alleles provided in the VCF input…” This is where I cannot get freebayes to successfully run.
The dDocent pipeline contains a section of snp calling code which saves on memory and checks for errors. Initially, I edited the freebayes line by replacing the original target file “mapped.bed” with -@ and -l and the gzipped VCF (lines 489 and 507 in dDocent) like so:
freebayes -b cat-RRG.bam -@ thinSNPs_adults2.recode.vcf.gz -l -v raw.$1.vcf -f ../reference.fasta -m 5 -q 5 -E 3 --min-repeat-entropy 1 -V --populations ../popmap -n 10 -F 0.1
Note: cat-RRG.bam is a concatenated file of all my samples’ BAM files created by dDocent and is what I have used in all my runs.
This completely froze up, wouldn’t stop running, and stayed at 0% genotyped until I ended it. After this, I wanted to back up and see if the -@ flag would work with my files, so I ran a separate “bare bones” freebayes script outside of dDocent and without any of the extra parameters:
freebayes -f reference.fasta -@ thinSNPs_adults2.recode.vcf.gz cat-RRG.bam > basic1SNPs.vcf
It began to call snps but was taking forever to finish (which I expected) so I ended the job. Upon inspection, it was just calling snps and genotypes at all contigs and all sites, so I am wondering if the -@ flag isn’t working with how I have it set up.
Additionally, I wanted to try using freebayes-parallel to run my script more efficiently:
freebayes-parallel <fasta_generate_regions.py reference.fasta="" 100000)="" 32="" -f="" reference.fasta="" -@="" thinSNPs_adults2.recode.vcf.gz="" -l="" cat-RRG.bam="" -m="" 5="" -q="" 5="" -E="" 3="" --min-repeat-entropy="" 1="" -V="" --populations="" popmap="" -n="" 10="" -F="" 0.1=""> basic2SNPs.vcf
It wouldn’t run and I got the following errors:
parallel: Warning: No more file handles.
parallel: Warning: Raising ulimit -n or /etc/security/limits.conf may help
Error: signal 11: freebayes(+0xbd2ab)[0x5650f7c792ab] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7fd5ec525040] /home/acoles/.conda/envs/acoles_env/bin/../lib/libstdc++.so.6(_ZSt28_Rb_tree_rebalance_for_erasePSt18_Rb_tree_node_baseRS_+0x26)[0x7fd5ecf9faf9] freebayes(+0x6811a)[0x5650f7c2411a] freebayes(+0x6a2b0)[0x5650f7c262b0] freebayes(+0xa81b)[0x5650f7bc681b] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fd5ec507bf7] freebayes(+0x10ea2)[0x5650f7bccea2]
I have been learning more about what these errors mean and how different people have tried to address them, but it is a little out of my depth.
Before addressing the freebayes-parallel issue though, I first wanted to ask if I am outlining my freebayes code correctly since my barebones script isn’t utilizing the -@ flag. I greatly appreciate any insight or suggestions I could try.