Odd. WGS Extract provides the buttons for sorting and indexing once a file is opened (if it is not already sorted and/or indexed). BUT, a CRAM has to be coordinate sorted to be created. So that cannot be the issue.
Nebula has always used the 1K Genome references. HS38D1, HS38DH, etc. It has varied over time. These are different than the HG38 reference; especially on the Y chromosome. WGS Extract identifies the reference from the header and reports it in the stats. People are very loose and even ignorant to the reference model issues. It is a problem that needs to be fixed in the BAM format. At least maybe requiring the M5 field at all times. See https://bit.ly/34CO0vj for more information.
As mentioned, SAMTOOLS does not require the reference for a CRAM to be specified. It will look up the M5 signature for each sequence in the EBI online database. A CRAM is required to have the signatures in the header. Specifying the correct reference model is a convenience and slight time and space saver. It can also be more accurate if a standard reference model was not used -- it is the signature for each sequence individually.
My gut feel is the CRAM files are corrupted. As SAMTOOLS is not correctly parsing them either. WGS Extract is using SAMTOOLS internally to parse CRAM files. 30x WGS CRAM / BAM files are very large and can be difficult to download correctly -- especially if using a slow, wireless link like over a mobile phone connection. (This is sometimes the only connection people have to their desktop. I have run into this many times trying to help people debug issues with WGS Extract and it trying to download files like reference genomes directly.) If you have SAMTOOLS installed, you should also have a program "htsfile". You may want to use that to more simply look at the file header and format. And see what it reports.
Note that CRAM is a very unique, specialized compression format. It is more susceptible to single-bit, small errors than other compression formats. This is because it is not just compressing a text file but transforming it before compressing. Some OS's (like Apple MacOS) try to uncompress files automatically so they can look internally at the content. Safari does this by default if it thinks it recognizes the format. And is a constant problem with downloading vcf.gz files that are actually compressed in BGZF format by BGZIP and not GZIP. The CRAM compression format should not be recognized but ....
It has been 20 months since your post but I would still be interested to understand if it was resolved and what the problem was.
Are you providing correct reference file?
Thanks GenoMax I'm not providing any out of sheer ignorance.
I wouldn't even know what the correct reference fie would be, where I might get it, or how I might make samtools aware of it.
You will need to find out from Nebula which reference file they used to create the CRAM. Manipulations with CRAM file will need the reference. See: https://www.htslib.org/workflow/cram.html
Thanks GenoMax, much obliged, I shall see what they have to say about it.