I am relatively new to bioinformatics, though I have been doing scientific programming for a few years now.
The following example is illustrative of a recurring situation:
% ls -1
chr10.txt
chr11.txt
chr12.txt
chr13.txt
chr14.txt
chr15.txt
chr16.txt
chr17.txt
chr18.txt
chr19.txt
chr1.txt
chr20.txt
chr21.txt
chr22.txt
chr2.txt
chr3.txt
chr4.txt
chr5.txt
chr6.txt
chr7.txt
chr8.txt
chr9.txt
chrx.txt
chry.txt
Note how these file names, by default, don't get listed in the normal numeric ordering. To put it differently, their lexicographic and numeric orderings do not coincide.
My temptation (which may be signs of a "professional deformation") is to rename those files to something like this:
% ls -1
chr01.txt
chr02.txt
chr03.txt
chr04.txt
chr05.txt
chr06.txt
chr07.txt
chr08.txt
chr09.txt
chr10.txt
chr11.txt
chr12.txt
chr13.txt
chr14.txt
chr15.txt
chr16.txt
chr17.txt
chr18.txt
chr19.txt
chr20.txt
chr21.txt
chr22.txt
chrx.txt
chry.txt
...or maybe even this:
% ls -1
chr01.txt
chr02.txt
chr03.txt
chr04.txt
chr05.txt
chr06.txt
chr07.txt
chr08.txt
chr09.txt
chr10.txt
chr11.txt
chr12.txt
chr13.txt
chr14.txt
chr15.txt
chr16.txt
chr17.txt
chr18.txt
chr19.txt
chr20.txt
chr21.txt
chr22.txt
chr_x.txt
chr_y.txt
...so that names sort naturally in numeric order, and (much less importantly), they line up when printed in a column.
Putting aside the fact that much in-house bioinformatics code out there is already dependent on the 1, 2, 3-type numbering, would it be an abomination in the eyes of most bioinformaticians to use 01, 02, 03, etc. instead of 1, 2, 3, etc. to number human chromosomes?
It may be tempting, but don't rely on the file system to order things for you. Use
-v
in GNU tools likels
andsort
, for instance, to sort file names that have prefixes that you want to order "naturally". See: https://www.gnu.org/software/coreutils/manual/html_node/Details-about-version-sort.html