Hello,
I would like to create a consensus sequence of a bacterium starting from the fastq files generated with Illumina. I proceeded as follows:
1. Remove poor quality sequences with trimmomatic (SLIDINGWINDOW:4:30)
2. Align with BWA MEM2 (piped with `samtools view -Sb` and `samtools sort`)
3. Indexed the alignment with `samtools index`
4. Deduplicated the sequence with sambamba markdup
5. Indexed again the alignment with `samtools index`
6. Sorted with `sambamba sort`
7. Finally created the consensus with `samtools consensus -o <outputfile>.fa -f fasta -a -l 80 <input_aligned-sorted-deduplicated-sortedAgain>.bam
However, a first bad omen is that the header of the consensus is that of the reference genome, which I did not use anywhere else in the process. The second omen is that the dotplot between the consensus and the reference shows 100% identity, whereas there should be some mutations in my sequences.
I think that, along the line, the pipe has simply pasted the reference sequence as the consensus.
Where did I get it wrong and how can I produce a consensus sequence from fastq files?
Thank you
A few comments.
Firstly, using
samtools view -Sb | samtools sort
is a meme that needs to die! Samtools doesn't need to be told the input is SAM - it autosenses this. Also sort doesn't require BAM, so it's wasted CPU cycles to convert SAM to BAM and then sort the BAM. Just dosamtools sort
on the SAM and it'll sort and convert to BAM (etc) itself for you.Secondly,
samtools consensus
is just using the reference name listed in the BAM file. It's the same name as you gave to bwa. If you wanted something different, give it something different! I'm not sure I understand your question here. It can't just magic up a new name of its own.Note you may also get better answers, at a cost of CPU, if you align to ref & sort, compute a sample specific consensus, realign to consensus & sort, compute new consensus once more. This will probably bring in more data as any read starting / ending in an insertion will be soft-clipped, and some larger insertions will be absent entirely. Ideally you'd be assembling denovo (and that gives you the consensus direct anyway), but depending on the technology that may not be computationally feasible for you.
I'm also not sure on steps 4, 5 and 6. Samtools markdup can operate without additional sort steps by applying part 1 (
fixmate -m
) between the aligner output and the first sort step, and part 2 (markdup
) after position sorting. This cuts out a huge chunk of CPU and I/O.Is there any reason not to attempt the assembly after step 1? The assembly is a "consensus" of all short reads and can be directly compared to your reference in a dot plot, or by average nucleotide identity as a rough estimate.
What I would do after step 2 is look at and evaluate the resulting BAM alignment file with IGV.
Do you have variants? Are these variants in majority at these positions?
I would also perform a variant calling with bcftools and use the resulting VCF file to generate another type of consensus
bcftools consensus
where the variants are merged into the reference.Both of these steps will help you understand what kind of variation is present and why the method may not work as expected.
PS: also make sure that you interpret dotplots correctly. Most dotplots are designed to indicate reorganization events and will not show single base differences.
Thank you, so what would be the steps? I ran IGV and the read looks properly mapped. But what I shall do after? Even if I make the VCF, what I need is a final fasta file...
what I suggested is that you look at variants, can you see clear cut variants in your IGV visualizations?
the results of the consensus calls is a FASTA file
Thank you. I will try with
bcftools consensus
. Just to be sure, is the alignment commandbwa-mem2 mem -t 8 ${index} ${paired_1}.fq ${paired_2}.fq | samtools view -Sb - | samtools sort -o ${AlnSrt.bam}
correct?${index}
is the reference file generated with${b}bwa-mem2 index -p {index} {index}.fa
, and${paired_x}
are the original fastq files,if you already have a BAM that you can view with IGV as stated above then you have performed the alignment correctly so move onto the next step
I ran bcftools and, visualizing the VCF file I can see that there are some SNP, which I assume means that the alignment was correct and did not simply replicate the reference:
it also has to be the major allele I believe, so check that too.
but just by looking at it, I think this should work with bcftools consensus, I am going to test samtools consensus myself since I have never run that.