Set `dupeMode` option as 'single'

Question

How can I convert a MAF of HAL (from cactus-hal2maf) to Multiple FASTA?

0

Entering edit mode

8 days ago

Mimi33 ▴ 10

(I edited some typo.)

Hi, I am planning to using GERP++ with a whole genome alignment. I want to get a multiple alignment FASTA file from my results from Cactus, a whole-genome aligner (https://github.com/ComparativeGenomicsToolkit/cactus/tree/master). I seems to have a correct HAL file with cactus from multiple genome FASTA files, then I convert it to MAF file with cactus-hal2maf.

singularity exec cactus.sif cactus-hal2maf ./jobStore ${hal} ${maf} \ --refGenome REF --noAncestors --chunkSize 1000000 --batchCores 4 \ --filterGapCausingDupes --refSequence ${chr}

NOTE: Although cactus has hal2fasta, but it only extracts original sequences without alignment information.

Next, I attempted to split the MAF file by chromosome using msa_view, but I encountered an error. Other programs I've tried have not been successful either.

The issue seems to be a mismatch between the reference sequences in my MAF file and the original genome:

msa_view cactus_alignment/Chromosome1.maf -f -G 1 --refseq Chromosome1.fasta > Chromosome1.maf.fasta

I received the following error:

Splitting 1 files by target sequence -- ignoring first argument dummy.bed
cactus_alignment/Chromosome1.maf
ERROR: character 'T' at position 108901 of reference sequence does not match character 'a' given in MAF file.

How can I convert my alignment to a multiple FASTA?

Thanks in advance for your help!

MAF cactus • 328 views

ADD COMMENT • link 3 days ago by Mimi33 ▴ 10

score 1 · Accepted Answer · 2024-09-26

I solved the problem by myself.

I set --dupemode single in cactus-hal2maf. all and consensus did not worked (I wonder it would caused by how the program treats self-alignment).
msa_view may prefer to read a chromosome FASTA from hal2fasta to an original one. (I don't know why.) Additionally, gerpcol can only read the oneline FASTA.
gerpcol cannot read extra characters which are not alphabets and "-" (ex. msa_view cannot read "-"). So I replaced * into - (I found gerpcol can read the MAF file if without -a option.)

Set `dupeMode` option as 'single'

The default option is as 'all'

cactus-hal2maf  ./js_${chr} ${hal} ${maf} \
 --refGenome REF --noAncestors --chunkSize 1000000 --batchCores 4 \
 --filterGapCausingDupes --refSequence ${chr} --dupeMode single \
 --rootGenome REF

Extract FASTA from the same hal

hal2fasta  --outFaPath ${chr_fasta} --sequence ${chr} ${hal} REF

Make FASTA per oneline

msa_view ${maf} -f -G 1 --refseq ${chr_fasta} --unmask | seqkit seq -w 0 > ${maf}.fasta

Standardize the gap character to "-"

`gerpcol` can only read alphabets and "-"

sed -e 's/\*/-/g' ${maf}.fasta > ${maf}.gap-replaced.fasta

Set dupeMode option as 'single'

The default option is as 'all'

Extract FASTA from the same hal

Make FASTA per oneline

Standardize the gap character to "-"

gerpcol can only read alphabets and "-"

Set `dupeMode` option as 'single'

`gerpcol` can only read alphabets and "-"