How to extract a multiple sequence alignment file from a CRAM file?
1
1
Entering edit mode
26 days ago
Charles Plessy ★ 2.9k

Hello everyone,

I have a multiple query genomes aligned to a target genome stored in a CRAM file where each query genome is represented by a read group. I want to extract a multiple sequence alignment for a specific region, say for instance chr1:1,000,000-1,111,000. Usual output formats such as MAF, ALN or aligned FASTA would be fine.

I tried samtools view but it reports all sequences which overlap the window, which can be extremely long in case there are two highly related genomes in the dataset, and I did not find a way or a tool to clip the output to the region only. Also, I do not know how to convert from SAM/BAM/CRAM to aligned FASTA (that is, FASTA with gaps) or other multiple alignment formats.

I also tried samtools tview, and its option -p allows me to jump directly to the region of interest in interactive mode. Unfortunately I have not found a way to export its output in any useful format. In addition, there seems to be a bug that prevented the --reference option to provide the target genome sequence.

I know that storing genome alignments in CRAM format is unconventional, but this saves a lot of space and the ability index the multiple genome alignment and extract regions appears powerful… if the output format would be useful.

Do you know a solution?

Charles

samtools cram • 614 views
ADD COMMENT
2
Entering edit mode

It sounds like you're trying to re-implement pangenomics using linear aligners.

Have you considered using minigraph-cactus and vg/odgi downstream instead ? I think you'll get more joy out of this for most downstream applications.

ADD REPLY
1
Entering edit mode

Thanks for the suggestion but I am working at large evolutionary distances (for instance aligning the human genome to 75 primate chromosome assemblies ranging from lemurs to great apes) and if I understand well, pangenome software are not designed for that. Also the pipeline I use (https://nf-co.re/pairgenomealign/) completes all the alignments in parallel within a hour, and to the best of my knowledge, cactus has the reputation of being very slow…

ADD REPLY
1
Entering edit mode

progressiveCactus is built for large evolutionary distances https://www.nature.com/articles/s41586-020-2871-y (https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/progressive.md)

it does mention running the alignment for like 2 months though so, intensive job... ("Both alignments were run on the AWS cloud over the course of 3 weeks for the avians and 2 months for the mammals")

they have also created an ecosystem for a data format called taffy that can store/index/decode multiple sequence alignments https://github.com/ComparativeGenomicsToolkit/taffy

ADD REPLY
1
Entering edit mode

how to convert from SAM/BAM/CRAM to aligned FASTA

See if these work:

https://sourceforge.net/projects/sam2fasta/files/
https://github.com/virus-evolution/gofasta (meant for COVID size genomes)

ADD REPLY
1
Entering edit mode
ADD COMMENT
0
Entering edit mode

which can be extremely long in case there are two highly related genomes in the dataset

Will sam2logo handle this case?

ADD REPLY
0
Entering edit mode

for 111,000 bp ? it should work

ADD REPLY
0
Entering edit mode

Thanks Pierre and GenoMax for your suggestions. I could not manage to run sam2fasta but was able to run SAM4WebLogo and gofasta. Unfortunately, I had problems in both cases: SAM4WebLogo did not cut the sequences at the right position, and gofasta missed some sequences and did not output inserted bases. I attached a screenshot. The data is from an alignment that does not use public data, but I can share it in private if there is interest. Here are my command lines:

samtools tview --reference ref.bgzip.fa.gz aln.cram XSR:1111431-1111661  -p XSR:1111431

./jvarkit.jar sam4weblogo --reference ref.bgzip.fa.gz -r XSR:1111431-1111641 aln.cram > sam4weblogo.fasta

samtools view -h --reference ref.bgzip.fa.gz aln.cram XSR:1111431-1111641  | ./gofasta-linux-amd64 sam toMultiAlign -r ref.bgzip.fa.gz -w 80 --start 1111431 --end 1111641 > gofasta.fasta

Test of SAM4WebLogo and gofasta.

ADD REPLY
0
Entering edit mode

gofasta missed some sequences and did not output inserted bases

gofasta page notes the following so that is in line with your observation.

insertions relative to the reference are discarded from the output file so everything is the same (== reference) length.

ADD REPLY

Login before adding your answer.

Traffic: 1682 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6