Question

Tool:Lightweight bash pipeline for ddRADseq read pre-processing, demultiplexing, and de-duplication using cutadapt

1

Entering edit mode

11 weeks ago

Rafal ▴ 70

Hi, I'd like to share a modular and transparent bash-based pipeline I’ve developed for pre-processing ddRADseq Illumina paired-end reads. It handles everything from adapter removal to demultiplexing and PCR duplicate filtering — all using standard tools like cutadapt, seqtk, and shell scripting.

The pipeline performs:

Adapter trimming with quality filtering (cutadapt)

Demultiplexing based on inline barcodes (cutadapt again)

Restriction site filtering + rescue of partially matching reads

Pairwise read deduplication using custom logic & DBR with seqtk + awk

Final read shortening

It is fully documented, lightweight, and designed for reproducibility. I created it for my own ddRAD projects, but I believe it might be useful for others working with RAD/GBS data too.

One of the main advantages is that it enables cleaner and more consistent input for downstream tools such as the STACKS pipeline, thanks to precise pre-processing and early duplicate removal. It helps avoid ambiguous or low-quality reads that can complicate locus assembly or genotype calling.

GitHub repository: https://github.com/rafalwoycicki/ddRADseq_reads

The scripts are especially helpful for people who want to avoid complex pipeline wrappers and prefer clear, customizable shell workflows.

Feedback, suggestions, and test results are very welcome! Let me know if you'd like to discuss use cases or improvements.

Best regards, Rafał

rad-seq ddRADseq demultiplexing deduplication • 1.4k views

ADD COMMENT • link 11 weeks ago by Rafal ▴ 70

2

Entering edit mode

I just had a brief look, some feedback:

I agree that workflow managers are not always strictly necessary, but if you do bash scripts then you should put some safety measures such as pipefail etc.
I see no indication which software is exactly used so for a new user it is unclear what to install. I saw cutadapt, ustacks and seqtk by briefly browsing the scripts. An install instruction or conda environment file would be helpful. Note that using zcat restricts this to Linux as macOS zcat behaves differently. Alternatively, gzip -cd would be more generic.
There are a lot! of hardcoded directory paths in these scripts. This should be implemented in a way that either it is provided by a flag such as --directory or via environmental variables so the user never has to change source code. Same goes for activating conda environments. This should be done outside or via a dedicated script only for software and paths, like a config file. Basically now the user has to go through all scripts,and modify paths.
I see no indication of expected CPU, memory and maybe disk space parameters. Is this hardcoded, or user-defined?
The readme would benefit from some markdown code highlighting or similar rather than bold-face for the instructions

ADD REPLY • link 11 weeks ago by ATpoint 88k

0

Entering edit mode

Thank you ATpoint for your feedback. I will try to add some safety measures, and thanks for the zcat info. I will try to list the dependiences. Thank a lot! The variables can be set in command line, also directory:

Help message

usage() {

  echo "Usage: source ddradseq_pre.bash [options]"

  echo "Options:"

  echo "  --iftest <value>         Set test mode: 1 for testing one barcode (default), -0 for not testing"

  echo "  --qualada <value>        Quality for adapter trimming (default: 20,20)"

  echo "  --qualfil <value>        Quality for filtering (default: 20,20)"

  echo "  --errbar <value>         Allowed errors demultiplexing (default: 1)"

  echo "  --errfil <value>         Allowed errors cutsite filtering (default: 0)"

  echo "  --thr <value>            Number of threads (default: 10)"

  echo "  --lenada <value>         Min length after adapter removal (default: 140)"

  echo "  --lenfil <value>         Min length after cutsite filtering (default: 140)"

  echo "  --readsP5 <file>         Forward reads file (default: 1.fq.gz)"

  echo "  --readsP7 <file>         Reverse reads file (default: 2.fq.gz)"

  echo "  --barcodes <file>        Barcodes file (fasta file with barcodes) (default not set)"

  echo "  --directory <path>       Project output directory set to blank/nothing if locally (default)"

  echo "  --help                   Show this help message"
}

Thank for your help. Best, Rafał

ADD REPLY • link updated 11 weeks ago by GenoMax 152k • written 11 weeks ago by Rafal ▴ 70

1

Entering edit mode

The variables can be set in command line, also directory

Is it? Looks hardcoded to me: https://github.com/rafalwoycicki/ddRADseq_reads/blob/main/ddradseq_dedup.bash#L6

For the dependencies, you can do a simple conda yaml file, such as this: https://github.com/ATpoint/rnaseq_preprocess/blob/main/environment.yml

ADD REPLY • link 11 weeks ago by ATpoint 88k

0

Entering edit mode

Right! It is ready in ddradseq_pre.bash but in ddradseq_dedup.bash command line variables setup is not working yet. The directory variable needs to be set inside script at the beginning still: <<directory="/...">>. Thanks!