Question

Question

Need to remove duplicates from GBS data?

0

Entering edit mode

2 days ago

MMW • 0

I have Genotype by sequencing (GBS) data of 74 individuals. My snakemake pipeline keeps shutting down after a few samples, I think because I run out of memory.

# This is my alignment step: 

bwa-mem2 mem -t {resources.cpus} -R "@RG\\tID:$rg\\tSM:$rg" {input.assembly} {input.reads} {input.reads2} | samblaster -r | samtools view -b - > {output}


# It seems to get stuck at the samblaster step in particular: 

samblaster: 
Removed   12418392 of   17461973 (71.117%) total read ids as duplicates using 96572k memory in 28.680S CPU seconds and 31M59S(1919S) wall time.

[Mon Nov 18 16:02:21 2024]

Finished job 26.

7 of 260 steps (3%) done

Shutting down, this might take some time.

Exiting because a job execution failed. 

Look above for error message

Complete log: .snakemake/log/2024-11-18T153020.863365.snakemake.log

WorkflowError:

At least one job did not complete successfully.

Question

I realize there is a high number of duplicates, I am planning on filtering out unfit samples after a qualimap test based on % of mapped reads. Is it really needed tot remove duplicates from GBS data?

GBS Samblaster Duplicates • 255 views

ADD COMMENT • link 1 day ago by MMW • 0

score 0 · Answer 1 · 2024-11-18

0

Entering edit mode

2 days ago

Istvan Albert 101k

I would look into the steps and software you use there, aligning and removing duplicates should not take that much memory.

There are alternative ways to doing both.

ADD COMMENT • link 2 days ago by Istvan Albert 101k

0

Entering edit mode

Thank you for your comment! I am currently using:

bwa-mem2 for aligning the reads
samtools view to have bam output
samtools sort & samtools index, but this likely won't take much memory. It is also not the step where the pipeline gets killed

I guess it is more of a question about GBS data. I know with WGS it makes sense to remove duplicates. But does that also apply to GBS data, as this is generated through restriction enzymes and PCR?

ADD REPLY • link 1 day ago by MMW • 0

2

Entering edit mode

if I recall correctly mem2 uses a lot more memory but is maybe a tad faster,

just use the regular bwa mem - alignment speed should not be a factor, I don't think you should remove duplicates either, duplicate removal is necessary when the duplicates are suspected to be from artificial source.

in GBS you will get lots of natural duplicates because the data is fragmented in certain locations

ADD REPLY • link 1 day ago by Istvan Albert 101k

0

Entering edit mode

Great! I appreciate the fast response :) Indeed it seems to be unnecessary to remove duplicates in GBS data.

The main issue was with memory and cpu allocation in the bwa mem shell command, exhausting the actual memory provided for the full snakefile. I adjusted this and it works now.

ADD REPLY • link 1 day ago by MMW • 0