Update 2: Digging deeper provided the insight, that starting with Ensembl Release 104, gene records in the GTF files do not appear in ascending order any more (in Releases 103 and lower, they did). Here's what I found out: Ensembl Release 104 and newer GTF files no longer have genes sorted by position
Update: It simply seems like these files are not necessarily listing genes in the order of their chromosomal position and the order can change from release to release. So you cannot rely on them staying the same, but have to ensure a stable sort yourself.
Long story short:
The GTF files from Ensembl Releases 105 and 106 are not sorted properly, for some reason. If you need them sorted, just avoid those versions (getting them sorted does not seem to be possible in a sane and sound way, I tried). Namely, these files are broken:
- https://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.gtf.gz
- https://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz
Releases 104, 107 and 108 seem to be properly sorted.
I have found no indication of anything relating to this anywhere with a reasonable web search, and by manually looking through the Ensembl release notes and the Ensembl production code repository on GitHub (which I think contains the code that produces those files). As this latter repo does not allow for the creation of issues (at least not for me) and as I did not find any way to file a bug report on the Ensembl website, either, I am documenting this problem here, for others to find. This has cause me multiple days of hunting down weird behaviour in a tool, where it eventually turned out that it relies on the input GTFs being sorted.
I often use the following for GTF/GFF sorting:
first sort on seq (k1) in natural sorting mode (V ; will sort like: seq1 seq2 seq10), then on start coord (k4), then rev on stop (k5) to get gene etc above cds and utr per gene, last rev on feature (k3) to get gene above CDS if they are equal
Captures more cases than the 'default' one but keep in mind not all sorting is done properly
Also a +1 for the AGAT approach :-)
*cough cough
CGAT will also do this:
Thanks for the response. The problem actually isn't in the sub-ordering within genes, but that genes were completely out of order, and only in these two versions. With the file format so loosely specified, consistency is important as tools will often implicitly rely on it. But thanks for all the sorting suggestions, some of them I hadn't seen yet (I did try to do the sorting myself for consistency, but couldn't get it to sort correctly with standard
sort
).