Question

Bulding a pangenome consensus from many individuals

2

Entering edit mode

9.3 years ago

susan.klein ▴ 80

Hi,

I need to make a pangenome as a reference to compare 15 closely related bacterial strains. I have 15 assemblies with 20-30 contigs each. I can easily compare all of them to a single reference using any number of programs but unless the reference contains all the mobile element variants and phage, information is lost. So my plan is to make a pseudo 'pangenome' (I know its not a pangenome in the traditional sense), which contains all the variations in roughly the correct coordinates. I have tried assembling all the contigs and this works to some extent but again excludes some elements. I have tried making an ordered pangenome of cds similar to the approach of PanOCT but this too sticks much of the variants onto the end of the genome and makes comparisons of the variable loci impossible. Perhaps this is not really possible but I thought that something must exist to align and make a consensus from multiple genomes using a reference completed genome (several of which I have too).

Thanks if anyone knows of an approach to do this.

genome alignment • 3.3k views

ADD COMMENT • link updated 3.1 years ago by Dr. Jason • 0 • written 9.3 years ago by susan.klein ▴ 80

0

Entering edit mode

Hi ! @susan I was looking around for essentially the same thing. Did you find a tool to do this?

ADD REPLY • link 3.1 years ago by Dr. Jason • 0

Ram · Answer 1 · 2016-01-09

0

Entering edit mode

9.3 years ago

dago ★ 2.8k

I must say that I did understand completely your problem.

However, if you have 15 assemblies (meaning 15 draft genomes) I would suggest you to annotate them and then run a program for pan/core genomes analysis. You could use PROKKA, for annotation it is quite fast, or RAST.

After that you can use a program for calculating the orthologues gene families in your genomes, and consequently the pan-genome. There are other discussion on this topic: A: Pan- and core- genome in bacteria

I personally use get_homologues

If you than want to align your genomes to a reference you can use something like MAUVE, but there are tons of tools for this

Hope this helps.

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by dago ★ 2.8k

0

Entering edit mode

Hi,

the problem is not a classical pangenome one like you describe. I have already done that with no problems. I want to be able to make a pseudo-genome that contains all large indel variants. I hoped I could do this by aligning and obtaining the consensus with something like cactus but I now see that its just not a practical thing to do.

Thanks anyway.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by susan.klein ▴ 80

Ram · Answer 2 · 2016-01-29

Hey.

Not sure if I understand what you are trying to accomplish. Are you trying to build a reference genome containing all the "pan-genomic" variation, so you can use that as a baseline for describing genomic variation?

If so, you can achieve this by:

Running get_homologues on the dataset as suggested by dago. (I advise using both -M and -G to intersect later, and also consider using -t 2 or 3 to remove singletons, some of which may be artifacial)
Using the downstream tool "compare_clusters.pl" included in get_homologues to produce a pan-intersection syntax: compare_clusters.pl -d (dir to MCL),(dir to COG) -m -o (output). You can use the flag -n if you want nucleotides instead of amino acids.
Read in each individual .fna/.faa file from 2), align them, and produce a consensus from the alignment. This can be performed using an R-script with the packages "DECIPHER" and "BioStrings", which have functions for aligning nucleotides/amino acids and for producing a consensus sequence from alignments.

R-script could be a loop that looks something like this (probably better ways of doing it). The example is for .fna (using -n in compare_clusters.pl). Script should (could) work if you have the packages installed and set the working directory to the output made by compare_clusters.pl

library("DECIPHER")
library("BioStrings")

library("seqinr") #For write.fasta

files <- list.files(pattern = ".fna")

Consensus.Seq.Vector <- NULL
for(a in 1:length(files)) {
  Seq <- readDNAStringSet(files[a])
  Aligned.seq <- AlignSeqs(Seq)
  Consensus.Seq.Vector[a] <- paste(consensusString(Aligned.seq, ambiguityMap="?", threshold=0.7))
}
Consensus.Seq.String <- paste(Consensus.Seq.Vector, sep="", collapse="")
write.fasta(sequences = Consensus.Seq.String, file.out="ConsensusGenome.fasta", open="a")

Hope this helps.