Entering edit mode
8 hours ago
FJCF
•
0
Hi everyone, I need to sort a big multifasta file (around 160GB) using the seqkit sort option and I've noticed that the sorted output is heavier than the input. Does anybody know why could this happen? I've used seqkit sort -N
.
Thanks in advance!
Does the input file contain long sequence lines while the sorted uses fixed width chunks?
I've checked it and both files have a multiline sequence with a fixed size with this structure:
please paste the result of 'seqkit stats' and 'seqkit sum' with the two files.
If the width is the same on both files the number of lines should be the same right?
Never use file sizes as a criteria for any QC/comparison other than in a qualitative way. e.g. is a file present. Is it zero bytes or does it contain stuff.
Perhaps this also applies in your case: https://askubuntu.com/questions/796947/why-is-my-sorted-file-bigger
For questions or bugs of a specific tool, asking the author is also a good way: https://github.com/shenwei356/seqkit/issues