Hello, I aim to identify SNPs from approximately 500 BAM files (non-human). I'm opting for bcftools since GATK, even with the Spark addition, takes a substantial 6 hours per sample. My objective is to generate a single VCF file encompassing all SNPs detected across the 500 samples. I'm considering two approaches:
Utilizing mpileup to process all BAM files simultaneously and subsequently calling SNPs. However, this method lacks parallelization, potentially resulting in a prolonged runtime.
Employing a parallelized approach by using mpileup on each file separately, allowing parallelization with a single thread for each run (so about 30 files simultaneously). Post-calling, I plan to merge the individual VCF files into one consolidated file. This approach may optimize the process, with the merging potentially outpacing the mpileup process.
Your insights on the most efficient strategy would be greatly appreciated.
how do you call with gatk ? do you use GVCF ? how does it compare to bcftools ? bcftools would take a huge amount of time.
Hello, and thank you for your response! Here is the command I used to invoke Spark with GATK HaplotypeCaller: gatk HaplotypeCallerSpark -I myfile.bam -R my.ref.fasta -O out.vcf
Notably, I did not employ the
-ERC GVCF
. In all my testing GATK was notably slower.I think GATK would be faster in GVCF for 500 bams
Are you suggesting that the speed improvement applies specifically when compared to the bcftools gvcf output, or is it faster overall? Personally, I don't require the GVCF format for my analysis; I only need the SNPs. However, if the process is faster, I see no downside in using it.
Sorry, I forgot to answer this. Bcftools tool with mpileup+call+filtering takes about an hour with one thread. So since I have 30 threads I can process 30 files per hour instead of one every 6 hours with spark.
sorry it's still not clear: do you want to process the 500 bams in one invocation of bcftools (sloooww) or do you want to process one bam per bcftools ?
This is actually my question and sorry for being unclear. What is better one bam at a time and then merging into one vcf or mpileup 500 bams from the beginning?