Question

How to extract a multiple sequence alignment file from a CRAM file?

1

Entering edit mode

26 days ago

Charles Plessy ★ 2.9k

Hello everyone,

I have a multiple query genomes aligned to a target genome stored in a CRAM file where each query genome is represented by a read group. I want to extract a multiple sequence alignment for a specific region, say for instance chr1:1,000,000-1,111,000. Usual output formats such as MAF, ALN or aligned FASTA would be fine.

I tried samtools view but it reports all sequences which overlap the window, which can be extremely long in case there are two highly related genomes in the dataset, and I did not find a way or a tool to clip the output to the region only. Also, I do not know how to convert from SAM/BAM/CRAM to aligned FASTA (that is, FASTA with gaps) or other multiple alignment formats.

I also tried samtools tview, and its option -p allows me to jump directly to the region of interest in interactive mode. Unfortunately I have not found a way to export its output in any useful format. In addition, there seems to be a bug that prevented the --reference option to provide the target genome sequence.

I know that storing genome alignments in CRAM format is unconventional, but this saves a lot of space and the ability index the multiple genome alignment and extract regions appears powerful… if the output format would be useful.

Do you know a solution?

Charles

samtools cram • 614 views

ADD COMMENT • link updated 25 days ago by cmdcolin ★ 4.2k • written 26 days ago by Charles Plessy ★ 2.9k

2

Entering edit mode

It sounds like you're trying to re-implement pangenomics using linear aligners.

Have you considered using minigraph-cactus and vg/odgi downstream instead ? I think you'll get more joy out of this for most downstream applications.

ADD REPLY • link 26 days ago by colindaven 7.3k

1

Entering edit mode

Thanks for the suggestion but I am working at large evolutionary distances (for instance aligning the human genome to 75 primate chromosome assemblies ranging from lemurs to great apes) and if I understand well, pangenome software are not designed for that. Also the pipeline I use (https://nf-co.re/pairgenomealign/) completes all the alignments in parallel within a hour, and to the best of my knowledge, cactus has the reputation of being very slow…

ADD REPLY • link 26 days ago by Charles Plessy ★ 2.9k

1

Entering edit mode

progressiveCactus is built for large evolutionary distances https://www.nature.com/articles/s41586-020-2871-y (https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/progressive.md)

it does mention running the alignment for like 2 months though so, intensive job... ("Both alignments were run on the AWS cloud over the course of 3 weeks for the avians and 2 months for the mammals")

they have also created an ecosystem for a data format called taffy that can store/index/decode multiple sequence alignments https://github.com/ComparativeGenomicsToolkit/taffy

ADD REPLY • link 25 days ago by cmdcolin ★ 4.2k

1

Entering edit mode

how to convert from SAM/BAM/CRAM to aligned FASTA

See if these work:

https://sourceforge.net/projects/sam2fasta/files/
https://github.com/virus-evolution/gofasta (meant for COVID size genomes)

ADD REPLY • link 26 days ago by GenoMax 150k

score 1 · Answer 1 · 2025-03-06

1

Entering edit mode

26 days ago

Pierre Lindenbaum 165k

I wrote sam2logo https://jvarkit.readthedocs.io/en/latest/SAM4WebLogo/

ADD COMMENT • link 26 days ago by Pierre Lindenbaum 165k

0

Entering edit mode

which can be extremely long in case there are two highly related genomes in the dataset

Will sam2logo handle this case?

ADD REPLY • link 26 days ago by GenoMax 150k

0

Entering edit mode

for 111,000 bp ? it should work

ADD REPLY • link 26 days ago by Pierre Lindenbaum 165k

0

Entering edit mode

Thanks Pierre and GenoMax for your suggestions. I could not manage to run sam2fasta but was able to run SAM4WebLogo and gofasta. Unfortunately, I had problems in both cases: SAM4WebLogo did not cut the sequences at the right position, and gofasta missed some sequences and did not output inserted bases. I attached a screenshot. The data is from an alignment that does not use public data, but I can share it in private if there is interest. Here are my command lines:

samtools tview --reference ref.bgzip.fa.gz aln.cram XSR:1111431-1111661  -p XSR:1111431

./jvarkit.jar sam4weblogo --reference ref.bgzip.fa.gz -r XSR:1111431-1111641 aln.cram > sam4weblogo.fasta

samtools view -h --reference ref.bgzip.fa.gz aln.cram XSR:1111431-1111641  | ./gofasta-linux-amd64 sam toMultiAlign -r ref.bgzip.fa.gz -w 80 --start 1111431 --end 1111641 > gofasta.fasta

Test of SAM4WebLogo and gofasta .

ADD REPLY • link 26 days ago by Charles Plessy ★ 2.9k

0

Entering edit mode

gofasta missed some sequences and did not output inserted bases

gofasta page notes the following so that is in line with your observation.

insertions relative to the reference are discarded from the output file so everything is the same (== reference) length.

ADD REPLY • link 25 days ago by GenoMax 150k