I have 5 strains of listeria.
I need to do Allele Calling and draw a tree (phylogenetics and spanning).
for the calling, I used chewbbaca: I ran the contigs.fasta I got from spades, but the output doesn't look well.
What input chewbbaca needs? Do I need to do annotation first?
And could I generate a tree with phyloviz from the output?
Your output looks good to me. You will find many allele calls that are simple integer values, other as INF-X (which stands for inferred allele X) and more calls like LNF, NIPH, NIPHEM, PLOT3, PLOT5, ALM, ASM, LOTSC. More info on that in the docs about allele calling.
By using the command (chewBBACA.py) ExtractCgMLST on your results_alleles.tsv, with parameter --t 0 (t: Maximum exclusion threshold), you should obtain a transformed table, with all the missing data turned into 0, and all the INF-X into the corresponding X allele. Then you could use Phyloviz to produce a tree from that.
I wrote a little python script, mlst2dist, which performs the calls transformations and computes pairwise Hamming distances -modified with correction for missing data- to produce dissimilarities matrices in PHYLIP and MEGA formats.
Is it fine to run as input the contigs (that I got after running SPAdes - assembly-base allele calling)?
And in the case I want to run also chewbbaca on raw reads (assembly-free allele calling), how can I get fasta files from end 1 and end 2 trimmed fastq.
(What I did is split the fastq paires-ends file with sratools, did trimming - I got 2 trimmed fastq files, one for each end. From that I can't go on, chewbbaca needs Fasta file for each strain)
I hope it is not too confuse? Because I am so confuse myself that it is difficult for me to express things well
(quoting the wiki docs) "In chewBBACA, schemas are composed of loci defined by CDSs and all the called alleles of a given locus are CDSs as defined by Prodigal".
So it is fine to use a set of assembled genomes as input; I can't see how Prodigal could identify complete CDS on the unassembled reads.
Also I'd suggest to take a look at the docs about using a Prodigal training file for your dataset and use it in the downstream steps of schema creation and allele calling. Because you are working with Listeria, I'd also point you to the existing L.monocytogenes.trn file in the prodigal_training_files repo dir. Hth
Is it fine to run as input the contigs (that I got after running SPAdes - assembly-base allele calling)? And in the case I want to run also chewbbaca on raw reads (assembly-free allele calling), how can I get fasta files from end 1 and end 2 trimmed fastq. (What I did is split the fastq paires-ends file with sratools, did trimming - I got 2 trimmed fastq files, one for each end. From that I can't go on, chewbbaca needs Fasta file for each strain)
I hope it is not too confuse? Because I am so confuse myself that it is difficult for me to express things well
Thank you for your help
(quoting the wiki docs) "In chewBBACA, schemas are composed of loci defined by CDSs and all the called alleles of a given locus are CDSs as defined by Prodigal". So it is fine to use a set of assembled genomes as input; I can't see how Prodigal could identify complete CDS on the unassembled reads. Also I'd suggest to take a look at the docs about using a Prodigal training file for your dataset and use it in the downstream steps of schema creation and allele calling. Because you are working with Listeria, I'd also point you to the existing L.monocytogenes.trn file in the prodigal_training_files repo dir. Hth
Thanks a lot i'll work on it !!