Question

What is the correct procedure to generate a consensus bacterial sequence?

0

Entering edit mode

14 months ago

marongiu.luigi ▴ 730

Hello,

I would like to create a consensus sequence of a bacterium starting from the fastq files generated with Illumina. I proceeded as follows:

1. Remove poor quality sequences with trimmomatic (SLIDINGWINDOW:4:30)
2. Align with BWA MEM2 (piped with  `samtools view -Sb` and `samtools sort`)
3. Indexed the alignment with `samtools index`
4. Deduplicated the sequence with sambamba markdup
5. Indexed again the alignment with `samtools index`
6. Sorted  with `sambamba sort`
7. Finally created the consensus with `samtools consensus -o <outputfile>.fa -f fasta -a -l 80 <input_aligned-sorted-deduplicated-sortedAgain>.bam

However, a first bad omen is that the header of the consensus is that of the reference genome, which I did not use anywhere else in the process. The second omen is that the dotplot between the consensus and the reference shows 100% identity, whereas there should be some mutations in my sequences.

I think that, along the line, the pipe has simply pasted the reference sequence as the consensus.

Where did I get it wrong and how can I produce a consensus sequence from fastq files?

Thank you

consensus samtools illumina fastq • 2.3k views

ADD COMMENT • link updated 10 months ago by Brian Bushnell 20k • written 14 months ago by marongiu.luigi ▴ 730

1

Entering edit mode

A few comments.

Firstly, using samtools view -Sb | samtools sort is a meme that needs to die! Samtools doesn't need to be told the input is SAM - it autosenses this. Also sort doesn't require BAM, so it's wasted CPU cycles to convert SAM to BAM and then sort the BAM. Just do samtools sort on the SAM and it'll sort and convert to BAM (etc) itself for you.

Secondly, samtools consensus is just using the reference name listed in the BAM file. It's the same name as you gave to bwa. If you wanted something different, give it something different! I'm not sure I understand your question here. It can't just magic up a new name of its own.

Note you may also get better answers, at a cost of CPU, if you align to ref & sort, compute a sample specific consensus, realign to consensus & sort, compute new consensus once more. This will probably bring in more data as any read starting / ending in an insertion will be soft-clipped, and some larger insertions will be absent entirely. Ideally you'd be assembling denovo (and that gives you the consensus direct anyway), but depending on the technology that may not be computationally feasible for you.

I'm also not sure on steps 4, 5 and 6. Samtools markdup can operate without additional sort steps by applying part 1 (fixmate -m) between the aligner output and the first sort step, and part 2 (markdup) after position sorting. This cuts out a huge chunk of CPU and I/O.

ADD REPLY • link 13 months ago by jkbonfield ★ 1.3k

0

Entering edit mode

Is there any reason not to attempt the assembly after step 1? The assembly is a "consensus" of all short reads and can be directly compared to your reference in a dot plot, or by average nucleotide identity as a rough estimate.

ADD REPLY • link 14 months ago by Mensur Dlakic ★ 28k

0

Entering edit mode

What I would do after step 2 is look at and evaluate the resulting BAM alignment file with IGV.

Do you have variants? Are these variants in majority at these positions?

I would also perform a variant calling with bcftools and use the resulting VCF file to generate another type of consensus bcftools consensus where the variants are merged into the reference.

Both of these steps will help you understand what kind of variation is present and why the method may not work as expected.

PS: also make sure that you interpret dotplots correctly. Most dotplots are designed to indicate reorganization events and will not show single base differences.

ADD REPLY • link 14 months ago by Istvan Albert 101k

0

Entering edit mode

Thank you, so what would be the steps? I ran IGV and the read looks properly mapped. But what I shall do after? Even if I make the VCF, what I need is a final fasta file...

ADD REPLY • link 13 months ago by marongiu.luigi ▴ 730

0

Entering edit mode

what I suggested is that you look at variants, can you see clear cut variants in your IGV visualizations?

the results of the consensus calls is a FASTA file

ADD REPLY • link 13 months ago by Istvan Albert 101k

0

Entering edit mode

Thank you. I will try with bcftools consensus. Just to be sure, is the alignment command bwa-mem2 mem -t 8 ${index} ${paired_1}.fq ${paired_2}.fq | samtools view -Sb - | samtools sort -o ${AlnSrt.bam} correct? ${index} is the reference file generated with ${b}bwa-mem2 index -p {index} {index}.fa, and ${paired_x} are the original fastq files,

ADD REPLY • link 13 months ago by marongiu.luigi ▴ 730

0

Entering edit mode

if you already have a BAM that you can view with IGV as stated above then you have performed the alignment correctly so move onto the next step

ADD REPLY • link 13 months ago by Istvan Albert 101k

0

Entering edit mode

I ran bcftools and, visualizing the VCF file I can see that there are some SNP, which I assume means that the alignment was correct and did not simply replicate the reference: enter image description here

ADD REPLY • link 13 months ago by marongiu.luigi ▴ 730

0

Entering edit mode

it also has to be the major allele I believe, so check that too.

but just by looking at it, I think this should work with bcftools consensus, I am going to test samtools consensus myself since I have never run that.

ADD REPLY • link 13 months ago by Istvan Albert 101k

score 3 · Accepted Answer · 2023-10-16

3

Entering edit mode

13 months ago

Brian Bushnell 20k

Personally, I'd do it this way, which is more robust versus indels and allows precise control over which variants are used (via additional parameters for consensus.sh or callvariants.sh):

bbduk.sh in=reads.fq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=70 ref=adapters ftm=5
clumpify.sh in=trimmed.fq.gz out=deduped.fq.gz passes=6 dedupe
#(any other preprocessing desired goes here)
bbmap.sh in=deduped.fq.gz out=mapped.sam.gz ref=ref.fa
consensus.sh in=mapped.sam.gz ref=ref.fa out=consensus.fa

You can alternatively do this:

callvariants.sh in=mapped.sam.gz ref=ref.fa out=vars.vcf
applyvariants.sh in=ref.fa vcf=vars.vcf out=consensus.fa

Both of these give you good control over what depth and allele frequencies are needed to trigger changes; the second one even lets you go through and delete lines from the VCF file if you want to. These are all in the BBTools package.

ADD COMMENT • link 13 months ago by Brian Bushnell 20k

0

Entering edit mode

Thank you, I ran the code (using in/in2 for the paired reads) and obtained the a reference sequence as wanted. If this is a recognized procedure, then I am done. Only thing: is it possible to name the consensus fasta with an option on consensus.sh? Now I get the same name of the reference, which is confusing.

ADD REPLY • link 10 months ago by marongiu.luigi ▴ 730

1

Entering edit mode

Not currently... I could add the option to append "_consensus" to it, I suppose. You can also run "rename.sh in=consensus.fa out=renamed.fa prefix=consensus" to prepend it.

ADD REPLY • link 10 months ago by Brian Bushnell 20k