Hi, I'd like to share a modular and transparent bash-based pipeline I’ve developed for pre-processing ddRADseq Illumina paired-end reads. It handles everything from adapter removal to demultiplexing and PCR duplicate filtering — all using standard tools like cutadapt, seqtk, and shell scripting.
The pipeline performs:
Adapter trimming with quality filtering (cutadapt)
Demultiplexing based on inline barcodes (cutadapt again)
Restriction site filtering + rescue of partially matching reads
Pairwise read deduplication using custom logic & DBR with seqtk + awk
Final read shortening
It is fully documented, lightweight, and designed for reproducibility. I created it for my own ddRAD projects, but I believe it might be useful for others working with RAD/GBS data too.
One of the main advantages is that it enables cleaner and more consistent input for downstream tools such as the STACKS pipeline, thanks to precise pre-processing and early duplicate removal. It helps avoid ambiguous or low-quality reads that can complicate locus assembly or genotype calling.
GitHub repository: https://github.com/rafalwoycicki/ddRADseq_reads
The scripts are especially helpful for people who want to avoid complex pipeline wrappers and prefer clear, customizable shell workflows.
Feedback, suggestions, and test results are very welcome! Let me know if you'd like to discuss use cases or improvements.
Best regards, Rafał
I just had a brief look, some feedback:
pipefail
etc.zcat
restricts this to Linux as macOS zcat behaves differently. Alternatively,gzip -cd
would be more generic.--directory
or via environmental variables so the user never has to change source code. Same goes for activating conda environments. This should be done outside or via a dedicated script only for software and paths, like a config file. Basically now the user has to go through all scripts,and modify paths.Thank you ATpoint for your feedback. I will try to add some safety measures, and thanks for the zcat info. I will try to list the dependiences. Thank a lot! The variables can be set in command line, also directory:
Help message
Thank for your help. Best, Rafał
Is it? Looks hardcoded to me: https://github.com/rafalwoycicki/ddRADseq_reads/blob/main/ddradseq_dedup.bash#L6
For the dependencies, you can do a simple conda yaml file, such as this: https://github.com/ATpoint/rnaseq_preprocess/blob/main/environment.yml
Right! It is ready in ddradseq_pre.bash but in ddradseq_dedup.bash command line variables setup is not working yet. The directory variable needs to be set inside script at the beginning still: <<directory="/...">>. Thanks!
ATpoint dependencies and hardcoding taken care of for the moment. Thanks again!