I have a couple of isolates of E. coli (K12_Ho and HB101_Ho) that I would like to compare one another and with the reference. I sequenced my isolates with Illumina and processed the fastq files as follows using bbtools:
bbduk.sh in=Raw/K12/k12_1.fq.gz in2=./Raw/K12/k12_2.fq.gz out=Trimmed/k12_1_Trm.fq.gz out2=Trimmed/k12_2_Trm.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=70 ref=adapters ftm=5 -Xmx200m
clumpify.sh in=Trimmed/k12_1_Trm.fq.gz in2=Trimmed/k12_2_Trm.fq.gz out=Trimmed/k12_1_TrmDed.fq.gz out2=Trimmed/k12_2_TrmDed.fq.gz passes=6 dedupe
# map
bbmap.sh in=Trimmed/k12_1_TrmDed.fq.gz in2=Trimmed/k12_2_TrmDed.fq.gz out=Mapped/k12_TrmDedMap.sam.gz ref=Ref/K12_ref.fa
# consensus
consensus.sh in=Mapped/k12_TrmDedMap.sam.gz ref=Ref/K12_ref.fa out=Consensus/K12_Ho_Cns.fa
sed -i 's/>NZ_CP097881.1/>K-12_Ho/g' Consensus/K12_Ho_Cns.fa
I repeated for the other isolate, merged the fasta of the reference NZ_CP097881 with the consensuses to generate a multifasta file that I then aligned with progressiveMauve and visualized with genoPlotR:
It looks to me that the sequences are exactly the same.
How can I be sure that the consensuses are genuine and I am not simply re-aligning the NZ_CP097881 with another name?
Is the pipeline correct, or have I simply renamed NZ_CP097881 as "K12_Ho_Cns"?
Thank you
Have you checked the options for
consensus.sh
to make sure you are satisfied with the defaults?You could simply assemble the reads on their own and compare the results. You must have enough data since you are using
clumpify
. No reference used.No, I used it as black box. I imagine the main command of the script is
local CMD="java $EA $EOOM $z -Xss8m -cp $CP consensus.ConsensusMaker $@"
, which I left untouched anyway...I was referring to the processing parameter default values:
You can align the reads to each reference and call variants:
callvariants.sh in=mapped.sam ref=ref.fa out=vars.vcf
The consensus assembly should not have any variants.
Thanks, for one strain I got:
The problem here is how to process the data. I can see from the vcf file that there are some (relatively--maybe they are those observed in the figure...) large elections (I counted 4838 bp in one instance). The other strain has fewer major deletions:
So the questions are: are the consensus correct or I shild work on the sam files unprocessed?
Thanks