Question

High memory sorting of kmers - best way to use sort

2

Entering edit mode

4 weeks ago

r.ackbersingh ▴ 30

Hello, I am looking for some advice

I am currently creating a kmer database and looking to merge/sort and take uniq lines from 47 sample.txt.gz which are 16gb each what would be the fastest way to do this. i currently am thinking to do this

zcat *.merged.kmers.txt.gz | sort --parallel=48 --buffer-size= 1400G | uniq | gzip > all_unique_kmers.txt.gz

i have the option of running on a slurm and with 48 cpus across two nodes, im quite new to this so any advice would be great, and even the slurm slide of things.

uniq linux sort kmers • 485 views

ADD COMMENT • link updated 29 days ago by Alex Reynolds 36k • written 4 weeks ago by r.ackbersingh ▴ 30

1

Entering edit mode

Please do not use bioinformatics as a tag unless your post is about the field of bioinformatics itself. For proper examples, please see Forum and News type posts under https://www.biostars.org/tag/bioinformatics/

I've removed the tag this time but please keep this in mind for future posts.

ADD REPLY • link 4 weeks ago by Ram 45k

0

Entering edit mode

I currently am thinking to do this

So does it work? Since you are able to allocate 1.4 TB (if the number above is correct) it should work.

Probably will take a long time, if you throw all files together in a single job. It may be better to keep working with a smaller set of files (say 4 to 6 ) at the time and keep building up with each step.

ADD REPLY • link updated 4 weeks ago by Ram 45k • written 4 weeks ago by GenoMax 149k

score 2 · Answer 1 · 2025-02-11

2

Entering edit mode

4 weeks ago

Alex Reynolds 36k

A merge sort is ideal for high-memory, parallelized sorting. Look at the -T option to sort to specify a temporary directory with your Slurm jobs. Ideally you would do everything on one compute node to limit I/O overhead.

ADD COMMENT • link 4 weeks ago by Alex Reynolds 36k

score 1 · Answer 2 · 2025-02-11

as you have several gz files you could sort them in parallel and then merge using sort --merge; Using nextflow, that would be something like (not tested):

    workflow {
     STEP1(Channel.fromPath(params.paths).splitText().map{file(it.trim())})
    STEP2(STEP1.output.collect())
    }

    process STEP1 {
    cpus 48
    input:
       path(f1)
    output:
       path("${f1.simpleName()}.txt"),emit:output
    script:
    """
    set -o pipefail
    gunzip -c ${f1}| LC_ALL=C sort -T . --parallel=${task.cpus} --buffer-size= 1400G | uniq > ${f1.simpleName()}.txt
    """
    }

    process STEP2 {
    input:
      path("SORTED/*")
    output:
     path("all_unique_kmers.txt.gz"),emit:output
    script:
    """
    LC_ALL=C sort -T . --merge SORTED/*.txt | uniq | gzip  > all_unique_kmers.txt.gz
    """
    }

score 1 · Answer 3 · 2025-02-11

If the number of unique kmers is not very large, i.e. you have lots of duplicates, you could skip the sorting altogether. In python or similar language (C would be best but probably is not worth the pain), you could read the input line by line and update a set with each new kmer. At the end print out the set entries one per line.

You can do this in parallel for each input file and then merge the outputs with cat | sort | uniq, which at this point should be fast.