Question

How to assure that sequence consensus is genuine?

0

Entering edit mode

8 months ago

marongiu.luigi ▴ 730

I have a couple of isolates of E. coli (K12_Ho and HB101_Ho) that I would like to compare one another and with the reference. I sequenced my isolates with Illumina and processed the fastq files as follows using bbtools:

bbduk.sh in=Raw/K12/k12_1.fq.gz in2=./Raw/K12/k12_2.fq.gz out=Trimmed/k12_1_Trm.fq.gz out2=Trimmed/k12_2_Trm.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=70 ref=adapters ftm=5 -Xmx200m
clumpify.sh in=Trimmed/k12_1_Trm.fq.gz in2=Trimmed/k12_2_Trm.fq.gz out=Trimmed/k12_1_TrmDed.fq.gz out2=Trimmed/k12_2_TrmDed.fq.gz passes=6 dedupe
# map
bbmap.sh in=Trimmed/k12_1_TrmDed.fq.gz in2=Trimmed/k12_2_TrmDed.fq.gz out=Mapped/k12_TrmDedMap.sam.gz ref=Ref/K12_ref.fa
# consensus
consensus.sh in=Mapped/k12_TrmDedMap.sam.gz ref=Ref/K12_ref.fa out=Consensus/K12_Ho_Cns.fa
sed -i 's/>NZ_CP097881.1/>K-12_Ho/g' Consensus/K12_Ho_Cns.fa

I repeated for the other isolate, merged the fasta of the reference NZ_CP097881 with the consensuses to generate a multifasta file that I then aligned with progressiveMauve and visualized with genoPlotR:

enter image description here

It looks to me that the sequences are exactly the same.

How can I be sure that the consensuses are genuine and I am not simply re-aligning the NZ_CP097881 with another name?

Is the pipeline correct, or have I simply renamed NZ_CP097881 as "K12_Ho_Cns"?

Thank you

genome • 778 views

ADD COMMENT • link 8 months ago by marongiu.luigi ▴ 730

0

Entering edit mode

Have you checked the options for consensus.sh to make sure you are satisfied with the defaults?

You could simply assemble the reads on their own and compare the results. You must have enough data since you are using clumpify. No reference used.

ADD REPLY • link 8 months ago by GenoMax 148k

0

Entering edit mode

No, I used it as black box. I imagine the main command of the script is local CMD="java $EA $EOOM $z -Xss8m -cp $CP consensus.ConsensusMaker $@", which I left untouched anyway...

ADD REPLY • link 8 months ago by marongiu.luigi ▴ 730

0

Entering edit mode

I was referring to the processing parameter default values:

Processing parameters:
mindepth=2      Do not change to alleles present at depth below this.
mafsub=0.25     Do not incorporate substitutions below this allele fraction.
mafdel=0.50     Do not incorporate deletions below this allele fraction.
mafins=0.50     Do not incorporate insertions below this allele fraction.
mafn=0.40       Do not change Ns (noref) to calls below this allele fraction.
usemapq=f       Include mapq as a positive factor in edge weight.
nonly=f         Only change Ns to different bases.
noindels=f      Don't allow indels.
ceiling=        If set, alignments will be weighted by their inverse identity.
                For example, at ceiling=105, a read with 96% identity will get
                bonus weight of 105-96=9 while a read with 70% identity will
                get 105-70=35.  This favors low-identity reads.

ADD REPLY • link 8 months ago by GenoMax 148k

0

Entering edit mode

You can align the reads to each reference and call variants:

callvariants.sh in=mapped.sam ref=ref.fa out=vars.vcf

The consensus assembly should not have any variants.

ADD REPLY • link 8 months ago by Brian Bushnell 20k

0

Entering edit mode

Thanks, for one strain I got:

Type            Count   Rate    AD  Depth   AF  Score   Qual
Substitutions:  1072    98.0%   134.4   141.2   0.941   41.7    32.9
Deletions:      17  1.6%    110.3   125.3   0.880   31.9    33.5
Insertions:     5   0.5%    129.8   191.8   0.729   23.2    34.4
Variation Rate: 1/4213
Homozygous:     1094    100.0%

The problem here is how to process the data. I can see from the vcf file that there are some (relatively--maybe they are those observed in the figure...) large elections (I counted 4838 bp in one instance). The other strain has fewer major deletions:

Type            Count   Rate    AD  Depth   AF  Score   Qual
Substitutions:  75  90.4%   322.1   386.2   0.903   40.6    34.4
Deletions:      6   7.2%    392.5   410.7   0.892   47.0    35.2
Insertions:     2   2.4%    328.0   392.0   0.740   36.0    34.0
Variation Rate: 1/55532
Homozygous:     83  100.0%

So the questions are: are the consensus correct or I shild work on the sam files unprocessed?

how can I handle these data? VCF is good but how to handle the data? How do I know that the election reported are the same in both strains? I can't go manually one by one... How can I visualize all this?

Thanks

ADD REPLY • link 8 months ago by marongiu.luigi ▴ 730