Hello everyone,
I have a multiple query genomes aligned to a target genome stored in a CRAM file where each query genome is represented by a read group. I want to extract a multiple sequence alignment for a specific region, say for instance chr1:1,000,000-1,111,000
. Usual output formats such as MAF, ALN or aligned FASTA would be fine.
I tried samtools view
but it reports all sequences which overlap the window, which can be extremely long in case there are two highly related genomes in the dataset, and I did not find a way or a tool to clip the output to the region only. Also, I do not know how to convert from SAM/BAM/CRAM to aligned FASTA (that is, FASTA with gaps) or other multiple alignment formats.
I also tried samtools tview
, and its option -p
allows me to jump directly to the region of interest in interactive mode. Unfortunately I have not found a way to export its output in any useful format. In addition, there seems to be a bug that prevented the --reference
option to provide the target genome sequence.
I know that storing genome alignments in CRAM format is unconventional, but this saves a lot of space and the ability index the multiple genome alignment and extract regions appears powerful… if the output format would be useful.
Do you know a solution?
Charles
It sounds like you're trying to re-implement pangenomics using linear aligners.
Have you considered using minigraph-cactus and vg/odgi downstream instead ? I think you'll get more joy out of this for most downstream applications.
Thanks for the suggestion but I am working at large evolutionary distances (for instance aligning the human genome to 75 primate chromosome assemblies ranging from lemurs to great apes) and if I understand well, pangenome software are not designed for that. Also the pipeline I use (https://nf-co.re/pairgenomealign/) completes all the alignments in parallel within a hour, and to the best of my knowledge, cactus has the reputation of being very slow…
progressiveCactus is built for large evolutionary distances https://www.nature.com/articles/s41586-020-2871-y (https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/progressive.md)
it does mention running the alignment for like 2 months though so, intensive job... ("Both alignments were run on the AWS cloud over the course of 3 weeks for the avians and 2 months for the mammals")
they have also created an ecosystem for a data format called taffy that can store/index/decode multiple sequence alignments https://github.com/ComparativeGenomicsToolkit/taffy
See if these work:
https://sourceforge.net/projects/sam2fasta/files/
https://github.com/virus-evolution/gofasta (meant for COVID size genomes)