Entering edit mode
2 days ago
MMW
•
0
I have Genotype by sequencing (GBS) data of 74 individuals. My snakemake pipeline keeps shutting down after a few samples, I think because I run out of memory.
# This is my alignment step:
bwa-mem2 mem -t {resources.cpus} -R "@RG\\tID:$rg\\tSM:$rg" {input.assembly} {input.reads} {input.reads2} | samblaster -r | samtools view -b - > {output}
# It seems to get stuck at the samblaster step in particular:
samblaster:
Removed 12418392 of 17461973 (71.117%) total read ids as duplicates using 96572k memory in 28.680S CPU seconds and 31M59S(1919S) wall time.
[Mon Nov 18 16:02:21 2024]
Finished job 26.
7 of 260 steps (3%) done
Shutting down, this might take some time.
Exiting because a job execution failed.
Look above for error message
Complete log: .snakemake/log/2024-11-18T153020.863365.snakemake.log
WorkflowError:
At least one job did not complete successfully.
Question
I realize there is a high number of duplicates, I am planning on filtering out unfit samples after a qualimap test based on % of mapped reads. Is it really needed tot remove duplicates from GBS data?
Thank you for your comment! I am currently using:
I guess it is more of a question about GBS data. I know with WGS it makes sense to remove duplicates. But does that also apply to GBS data, as this is generated through restriction enzymes and PCR?
if I recall correctly mem2 uses a lot more memory but is maybe a tad faster,
just use the regular bwa mem - alignment speed should not be a factor, I don't think you should remove duplicates either, duplicate removal is necessary when the duplicates are suspected to be from artificial source.
in GBS you will get lots of natural duplicates because the data is fragmented in certain locations
Great! I appreciate the fast response :) Indeed it seems to be unnecessary to remove duplicates in GBS data.
The main issue was with memory and cpu allocation in the bwa mem shell command, exhausting the actual memory provided for the full snakefile. I adjusted this and it works now.