Hi!
I have a very large CRAM file - alignments to a genome assembly.
I want to extract the header and all the reads mapping to chr1 and save them to a new file, also in CRAM format.
I don't need the original reference sequence, and I'm not sure if I even have it. The provider calls the genome version "hg38" but does not specify the specific version - i.e., the "patch" - that they used.
Is there a way to do this using samtools? Like I said, I don't need the reference, just the alignments.
I tried the following, but the command never finished:
samtools view --output-fmt cram -o chr1.cram BIGFILE.cram chr1
The index file BIGFILE.cram.crai is in the same working directory.
Thank you for your help!
Ann
use threads ? AND the ref. You cannot do much things with cram without a REF.
This command looks like it worked for me:
Thank you! Unfortunately I am getting this error:
The provider says the genome assembly version is hg38, but it seems my hg38 doesn't match theirs. I can open the "cram" file just fine in my genome browser and the sequences seem fine.
Do you know what "embed_ref=2" means? (It doesn't appear to be explained in the output of samtools view --help)
this is not the correct reference. The reference of each chromosome is identified by a checksum and they are not the same here.
embed_ref=2
replaces the reference with an on-the-fly constructed consensus of the alignment records.It can help to produce a CRAM file that doesn't have external dependencies and is useful when going from formats that don't use reference-based encoding (eg SAM or BAM), but it's not helpful where your input was a reference encoded CRAM and the reference doesn't match as that needs fixing first.
Note just checking alignments and stating that they appear fine doesn't mean they're correct! CRAM is encoding differences relative to the reference, but if the reference differs then any non-edit in the sequences will also change. If the MD5sums differ, then you need to figure out what reference the original file creator used. It may be you can just download it from the EBI refget server, but some people have a bad habbit of editing local references without thinking through the downstream consequences.