Hello, I am looking for some advice
I am currently creating a kmer database and looking to merge/sort and take uniq lines from 47 sample.txt.gz which are 16gb each what would be the fastest way to do this. i currently am thinking to do this
zcat *.merged.kmers.txt.gz | sort --parallel=48 --buffer-size= 1400G | uniq | gzip > all_unique_kmers.txt.gz
i have the option of running on a slurm and with 48 cpus across two nodes, im quite new to this so any advice would be great, and even the slurm slide of things.
Please do not use
bioinformatics
as a tag unless your post is about the field of bioinformatics itself. For proper examples, please see Forum and News type posts under https://www.biostars.org/tag/bioinformatics/I've removed the tag this time but please keep this in mind for future posts.
So does it work? Since you are able to allocate 1.4 TB (if the number above is correct) it should work.
Probably will take a long time, if you throw all files together in a single job. It may be better to keep working with a smaller set of files (say 4 to 6 ) at the time and keep building up with each step.