Hi,
This question is a possible duplicate to : Bulding a pangenome consensus from many individuals. I'm reposting it since I have run into a similar issue.
Background: We observe high variability of antibiotic resitance phenotype within different isolates of 1 bacterial species. To identify the genetic component encoding this variability I'm running a GWAS analysis (using DBGWAS) of ~ 900 isolates with their genome sequences. The output of GWAS gives me ~ 8000 short k-mer sequences (30 -70 nts) which point towards genes/regions from the input draft genomes which could be responsible for the phenotype. Usually the next step is to manually filter the output and find significant candidate genes, but it's often quite difficult to do this manually. One of the workaround I have found is to align all the 8000 k-mer sequences to the reference genome of the bacteium using bowtie. This works quite well and I can visuallise which genomic region has most of the k-mers concentrated at.
Problem: Since the gene/region responsible for the phenotype wouldn't be present in all the isolates I am having difficulty choosing a reference to align all the k-mers to. RIght now I'm using roary to build a pangeome out of all the draft genomes. As far as I understand, this would just result in a core-gene set + accessory gene set (or pan-proteome), the problem is I think this would a loose all the snp variation, intergenic regions.
Is there a better tool to condensemultiple genomes together, removing duplicate genes, but maintaining variant genes, intergenic regions etc to output a reference genome containing all the variant information.
Thank you!
Maybe you can build pan-genome graph with Pandora
This looks interesting. I understand it makes de bruijn graphs for the pan-genome. But do you know what kind of format a pan-genome graph is? would it be a .bam file or a multi-fasta? i'm just wondering if practically I could use bowtie to map k-mers to the pan-genome graph sequence file?
I have never tryied this before, but looking at the documentation the reference graph looks like a multi-fasta file of a specific genomic region shared by all the strains used to build the graph: link. This graph is later used in Pandora to output VCF file with all the variants detected in graph.
I must say that this pipeline looks quite complex so I would try the tutorial first