Need to split a cram file into smaller pieces - how to extract and save cram format to a new file?
1
0
Entering edit mode
3 months ago
Ann ★ 2.4k

Hi!

I have a very large CRAM file - alignments to a genome assembly.

I want to extract the header and all the reads mapping to chr1 and save them to a new file, also in CRAM format.

I don't need the original reference sequence, and I'm not sure if I even have it. The provider calls the genome version "hg38" but does not specify the specific version - i.e., the "patch" - that they used.

Is there a way to do this using samtools? Like I said, I don't need the reference, just the alignments.

I tried the following, but the command never finished:

samtools view --output-fmt cram -o chr1.cram BIGFILE.cram chr1

The index file BIGFILE.cram.crai is in the same working directory.

Thank you for your help!

Ann

samtools CRAM • 642 views
ADD COMMENT
0
Entering edit mode

use threads ? AND the ref. You cannot do much things with cram without a REF.

samtools view -T ref.fa --threads 16 --output-fmt cram -o chr1.cram BIGFILE.cram chr1
ADD REPLY
1
Entering edit mode

This command looks like it worked for me:

samtools view --threads 16 -Cho chr1.cram BIGFILE.cram chr1 
ADD REPLY
0
Entering edit mode

Thank you! Unfortunately I am getting this error:

[E::cram_flush_thread] Call to cram_encode_container failed [E::validate_md5] SQ header M5 tag discrepancy for reference 'chr1' [E::validate_md5] SQ header M5 tag discrepancy for reference 'chr1' [E::validate_md5] Please use the correct reference, or consider using embed_ref=2

The provider says the genome assembly version is hg38, but it seems my hg38 doesn't match theirs. I can open the "cram" file just fine in my genome browser and the sequences seem fine.

Do you know what "embed_ref=2" means? (It doesn't appear to be explained in the output of samtools view --help)

ADD REPLY
0
Entering edit mode

this is not the correct reference. The reference of each chromosome is identified by a checksum and they are not the same here.

ADD REPLY
0
Entering edit mode

embed_ref=2 replaces the reference with an on-the-fly constructed consensus of the alignment records.

It can help to produce a CRAM file that doesn't have external dependencies and is useful when going from formats that don't use reference-based encoding (eg SAM or BAM), but it's not helpful where your input was a reference encoded CRAM and the reference doesn't match as that needs fixing first.

Note just checking alignments and stating that they appear fine doesn't mean they're correct! CRAM is encoding differences relative to the reference, but if the reference differs then any non-edit in the sequences will also change. If the MD5sums differ, then you need to figure out what reference the original file creator used. It may be you can just download it from the EBI refget server, but some people have a bad habbit of editing local references without thinking through the downstream consequences.

ADD REPLY
0
Entering edit mode
3 months ago
jkbonfield ★ 1.3k

You can do this efficiently using io_lib's cram_filter tool (https://github.com/jkbonfield/io_lib).

Note this functionality has been added to samtools cat and will appear in the next release (Samtools 1.21).

The advantage of this way is it doesn't decode the CRAM file as it simply reads the index to work out which disk blocks need copying. You can even manually replicate this yourself by zcat on the .crai to read the index and then crafting the necessary dd commands to add CRAM header, a series of containers, and footer. However that requires detailed knowledge of the format, hence the reason for cram_filter or cat.

ADD COMMENT

Login before adding your answer.

Traffic: 3010 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6