Noob question. Samtools says my CRAM from Nebula is wonky, what can I do?
2
2
Entering edit mode
2.2 years ago
cromcruach ▴ 70

Hi there,

I've run into a problem and I can't solve it with Google-fu

I got my CRAM from Nebula Genomics and was wanting to use WGSExtract on it but WGSExtract says that it isn't indexed or sorted but when I try to view or sort it in samtools it tells me:

Failed to populate reference for id 0
Unable to fetch reference #0 10000..39404
Failure to decode slice
[main_samview] truncated file.

I really don't know if it's something to do with my install of samtools not working or if I need to do something else to the Nebula Genomics CRAM file.

I tried it also on my brother's file from Nebula and samtools says the same thing to me so I don't think the file can be 'bad'

If it makes a difference I'm using Ubuntu WSL

Does anybody out there know what I can do about this problem? Thanks.

CRAM Nebula samtools • 2.5k views
ADD COMMENT
1
Entering edit mode

Are you providing correct reference file?

ADD REPLY
0
Entering edit mode

Thanks GenoMax I'm not providing any out of sheer ignorance.

I wouldn't even know what the correct reference fie would be, where I might get it, or how I might make samtools aware of it.

ADD REPLY
1
Entering edit mode

You will need to find out from Nebula which reference file they used to create the CRAM. Manipulations with CRAM file will need the reference. See: https://www.htslib.org/workflow/cram.html

ADD REPLY
0
Entering edit mode

Thanks GenoMax, much obliged, I shall see what they have to say about it.

ADD REPLY
1
Entering edit mode
23 months ago
Jan Kotrs ▴ 10

Hi, nebula uses MGI instruments to sequence. You could interrogate the .cram headers via samtools view -H your.cram

Mine (from nebula) states UR:/mnt/ssd/MegaBOLT_scheduler/reference/hg38.fa which after bit of googling matches the "MegaBOLT" user manual's reference:

MegaBOLT provides three references: hg19 (default), hg38, and hs37d5.
The storage diretory is : /mnt/ssd/MegaBOLT/reference/. Major files and their
explanation are shown in the table below..

Then below the listing there's a note:

The hg19, hg38, and their associated files can be downloaded from NCBI.

So if that is correct my reads are aligned using hg38 from NCBI, likely: https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.26/ (but I can be wrong about the specific assembly build)

The referenced manual: https://en.mgi-tech.com/Uploads/detail/2020-03-05/5e60b822cb712.pdf

Good luck with your analysis!

ADD COMMENT
0
Entering edit mode

The CRAM headers should have have M5 tags per @SQ line with the checksum of the sequence. If you use the samtools misc/seq_cache_populate.pl script to expand a fasta file into a directory of M5-named filenames (one per sequence) then CRAM decoding is faster due to no longer needing to parse fasta every time, but also will be more bullet-proof to any shenanigans of people attempting to use custom edited references without adding proper data-provenance in the file headers.

See the samtools man page for details (ENVIRONMENT VARIABLES section)

ADD REPLY
0
Entering edit mode

Did anyone tried this approach so far with success? Scrating my head over my nebula files right now and don't want to perform WGS calling twice.

ADD REPLY
0
Entering edit mode
11 weeks ago
Randy H ▴ 110

Odd. WGS Extract provides the buttons for sorting and indexing once a file is opened (if it is not already sorted and/or indexed). BUT, a CRAM has to be coordinate sorted to be created. So that cannot be the issue.

Nebula has always used the 1K Genome references. HS38D1, HS38DH, etc. It has varied over time. These are different than the HG38 reference; especially on the Y chromosome. WGS Extract identifies the reference from the header and reports it in the stats. People are very loose and even ignorant to the reference model issues. It is a problem that needs to be fixed in the BAM format. At least maybe requiring the M5 field at all times. See https://bit.ly/34CO0vj for more information.

As mentioned, SAMTOOLS does not require the reference for a CRAM to be specified. It will look up the M5 signature for each sequence in the EBI online database. A CRAM is required to have the signatures in the header. Specifying the correct reference model is a convenience and slight time and space saver. It can also be more accurate if a standard reference model was not used -- it is the signature for each sequence individually.

My gut feel is the CRAM files are corrupted. As SAMTOOLS is not correctly parsing them either. WGS Extract is using SAMTOOLS internally to parse CRAM files. 30x WGS CRAM / BAM files are very large and can be difficult to download correctly -- especially if using a slow, wireless link like over a mobile phone connection. (This is sometimes the only connection people have to their desktop. I have run into this many times trying to help people debug issues with WGS Extract and it trying to download files like reference genomes directly.) If you have SAMTOOLS installed, you should also have a program "htsfile". You may want to use that to more simply look at the file header and format. And see what it reports.

Note that CRAM is a very unique, specialized compression format. It is more susceptible to single-bit, small errors than other compression formats. This is because it is not just compressing a text file but transforming it before compressing. Some OS's (like Apple MacOS) try to uncompress files automatically so they can look internally at the content. Safari does this by default if it thinks it recognizes the format. And is a constant problem with downloading vcf.gz files that are actually compressed in BGZF format by BGZIP and not GZIP. The CRAM compression format should not be recognized but ....

It has been 20 months since your post but I would still be interested to understand if it was resolved and what the problem was.

ADD COMMENT
0
Entering edit mode

This has nothing to do with single-bit errors or OSes checking in files. Like BAM, CRAM has checksums so it'll spot single-bit errors and report them as CRC failures. This error is basically as it states (albeit a bit tersely) - it cannot find the appropriate reference sequence.

If it cannot find them from the SQ M5 header tag it's possible that the machine is firewalled off and is unable to connect to the EBI reference servers. This is a slow way to do things anyway and it's advisable to use a local repository for references or to manually give it a fasta file (provided you know what it is, but try samtools view -H to look). Don't just guess though! That'll make things worse.

(I've no idea though about WGSExtract and whether it has options to control things like where the reference file is. If not, convert it first using samtools instead.)

ADD REPLY

Login before adding your answer.

Traffic: 2175 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6