High memory sorting of kmers - best way to use sort
3
2
Entering edit mode
4 weeks ago

Hello, I am looking for some advice

I am currently creating a kmer database and looking to merge/sort and take uniq lines from 47 sample.txt.gz which are 16gb each what would be the fastest way to do this. i currently am thinking to do this

zcat *.merged.kmers.txt.gz | sort --parallel=48 --buffer-size= 1400G | uniq | gzip > all_unique_kmers.txt.gz

i have the option of running on a slurm and with 48 cpus across two nodes, im quite new to this so any advice would be great, and even the slurm slide of things.

uniq linux sort kmers • 485 views
ADD COMMENT
1
Entering edit mode

Please do not use bioinformatics as a tag unless your post is about the field of bioinformatics itself. For proper examples, please see Forum and News type posts under https://www.biostars.org/tag/bioinformatics/

I've removed the tag this time but please keep this in mind for future posts.

ADD REPLY
0
Entering edit mode

I currently am thinking to do this

So does it work? Since you are able to allocate 1.4 TB (if the number above is correct) it should work.

Probably will take a long time, if you throw all files together in a single job. It may be better to keep working with a smaller set of files (say 4 to 6 ) at the time and keep building up with each step.

ADD REPLY
2
Entering edit mode
4 weeks ago

A merge sort is ideal for high-memory, parallelized sorting. Look at the -T option to sort to specify a temporary directory with your Slurm jobs. Ideally you would do everything on one compute node to limit I/O overhead.

ADD COMMENT
1
Entering edit mode
4 weeks ago

as you have several gz files you could sort them in parallel and then merge using sort --merge; Using nextflow, that would be something like (not tested):

    workflow {
     STEP1(Channel.fromPath(params.paths).splitText().map{file(it.trim())})
    STEP2(STEP1.output.collect())
    }

    process STEP1 {
    cpus 48
    input:
       path(f1)
    output:
       path("${f1.simpleName()}.txt"),emit:output
    script:
    """
    set -o pipefail
    gunzip -c ${f1}| LC_ALL=C sort -T . --parallel=${task.cpus} --buffer-size= 1400G | uniq > ${f1.simpleName()}.txt
    """
    }

    process STEP2 {
    input:
      path("SORTED/*")
    output:
     path("all_unique_kmers.txt.gz"),emit:output
    script:
    """
    LC_ALL=C sort -T . --merge SORTED/*.txt | uniq | gzip  > all_unique_kmers.txt.gz
    """
    }
ADD COMMENT
0
Entering edit mode

If you're using a BSD Unix (macOS etc.) their sort may include other options, such as --mergesort. It also includes --mmap, which can help speed up data ingress and processing.

ADD REPLY
1
Entering edit mode
4 weeks ago

If the number of unique kmers is not very large, i.e. you have lots of duplicates, you could skip the sorting altogether. In python or similar language (C would be best but probably is not worth the pain), you could read the input line by line and update a set with each new kmer. At the end print out the set entries one per line.

You can do this in parallel for each input file and then merge the outputs with cat | sort | uniq, which at this point should be fast.

ADD COMMENT

Login before adding your answer.

Traffic: 2416 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6