PanSN naming issue in pggb fastas
2
0
Entering edit mode
3 days ago
Whirlingdaf ▴ 60

I am having an issue with the PanSN naming for pggb fastas and would love thoughts!

I am running:

fastix -p "NP117#hap#" 117_Assembly.fasta > 117_shasta_fastix_output.fasta

But run into naming issues with:

pggb -i combined_aligned_contigs.fasta.gz -o pangenome

[pggb] warning: there are sequence names (like 'NP117#hap#178') that do not match the Pangenome Sequence Naming (PanSN). [pggb] ERROR: -n/--n-haplotypes must be greater than or equal to 1 when the Pangenome Sequence Naming (PanSN) is not respected.

Unsure what the issue is as the fastix example is:

fastix -p "gen#1#" genome.fa > genome_prefixed.fa

Any thoughts?

PanSN pggb • 444 views
ADD COMMENT
1
Entering edit mode
2 days ago
cmdcolin ★ 4.0k

looks like the naming scheme pggb wants has an integer in between those two #, so 1 is allowed but hap is not

the regex it checks appears to be ^([^#]+#)[0-9]+#[^#]+$ (from https://github.com/pangenome/pggb/blob/cc332526727a9cc99f4194ef47212e7c06175106/pggb#L312C20-L312C42)

gemini explains the regex as follows

This regex aims to match strings that:

  • Start with any sequence of characters followed by a single "#".
  • Continue with one or more digits.
  • Followed by another single "#".
  • End with any sequence of characters.

this confirms what https://github.com/pangenome/PanSN-spec says: the haplotype id is the middle part which is a number

ADD COMMENT
0
Entering edit mode
2 days ago

Exactly what cmdcolin said.

You can check your fasta headers using

samtools faidx 117_Assembly.fasta

cat 117_Assembly.fasta.fai
ADD COMMENT

Login before adding your answer.

Traffic: 1442 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6