Question

Generating consensus sequence from bam file

7

Entering edit mode

6.3 years ago

chparada ▴ 70

Hi,

I am trying to generate consensus sequence from a bam file obtained after mapping SRA reads to a reference genome.

I used the following commands:

bwa mem ref.fasta SRR_1.fastq SRR_2.fastq > bwa.sam
samtools view -b -F 4 bwa.sam > bwa_aligned.bam
samtools index bwa_aligned.bam

I am not sure how to generate the consensus sequence that I have in mind. In case I don't explain this well. I made a diagram:

===========================================================>ref.fasta
- -- ---- ----      ----- --- --- -      --    ------ ----- 
------ --- ---     -------- --- --       ---- -      --- -->SRR_reads_mapping



==============  +  ================  +   ==================> consensus_sequence.fasta

Please let me know if you have any advice on this.

Cheers!!!

genome samtools bwa fasta • 43k views

ADD COMMENT • link updated 8 months ago by jkbonfield ★ 1.3k • written 6.3 years ago by chparada ▴ 70

0

Entering edit mode

What do you mean by consensus sequence? The most frequent nucleotide in each position? The variants? There are plenty of tools that can give you the reading in each position, try bcftools mpileup for instance.

ADD REPLY • link 6.3 years ago by Asaf 10k

0

Entering edit mode

Thank you for your answers.

I meant obtaining the most frequent nucleotide in each position. From the mapped reads, I want to been able to obtain a consensus sequence in single fasta file. I am going to try bcftools mpileup.

ADD REPLY • link 6.3 years ago by chparada ▴ 70

0

Entering edit mode

see below my answer thanks

ADD REPLY • link 5.4 years ago by vsingh ▴ 100

score 11 · Answer 1 · 2022-09-01

11

Entering edit mode

2.8 years ago

aswinssoman ▴ 110

The lastest version of samtools_V1.16 allows you to generate a consensus from a SAM, BAM or CRAM file based on the contents of the alignment records. The consensus is written either as FASTA, FASTQ, or a pileup oriented format. This is selected using the -f FORMAT option.

Eg:

samtools consensus -f fasta in.bam -o cons.fa

ADD COMMENT • link 2.8 years ago by aswinssoman ▴ 110

0

Entering edit mode

After conversion i got multiple NNNNN in the final fasta file. can anyone tell be why this happens

ADD REPLY • link 9 months ago by skdv2522 • 0

0

Entering edit mode

The samtools manual indicates regarding the --min-depth argument:

The minimum depth required to make a call. Defaults to 1. Failing this depth check will produce consensus "N", or absent if it is an insertion. Note this check is performed after filtering by flags and mapping/base quality.

So it seems you don't have a deep enough coverage for these specific positions with NNNN.

ADD REPLY • link 8 months ago by garcesj ▴ 50

Kevin Blighe · Answer 2 · 2020-01-17

1) Map short reads against reference gene sequence

# Create bowtie2 database
bowtie2-build REFERENCE.fasta REF_DB

# bowtie2 mapping
bowtie2 -x REF_DB -U SAMPLE.fastq --no-unal -S SAMPLE.sam

# samtools:  sort .sam file and convert to .bam file
samtools view -bS SAMPLE.sam | samtools sort - -o SAMPLE.bam

2) Get consensus sequence from .bam file

# Get consensus fastq file
samtools mpileup -uf REFERENCE.fasta SAMPLE.bam | bcftools call -c | vcfutils.pl vcf2fq > SAMPLE_cns.fastq

  # vcfutils.pl is part of bcftools

# Convert .fastq to .fasta and set bases of quality lower than 20 to N
seqtk seq -aQ64 -q20 -n N SAMPLE_cns.fastq > SAMPLE_cns.fasta

score 5 · Answer 3 · 2019-03-06

5

Entering edit mode

6.3 years ago

nsmi8446 ▴ 170

You could call variants (using whatever variant calling software you like, GATK, freebayes etc.) from your .bam file and then use vcf-consensus (http://vcftools.sourceforge.net/perl_module.html#vcf-consensus) to build your consensus sequence. The code below should work:

cat ref.fa | vcf-consensus file.vcf.gz > out.fa

ADD COMMENT • link 6.3 years ago by nsmi8446 ▴ 170

1

Entering edit mode

I agree with nsmi8446. This is a nice concise way to solve your problem.

ADD REPLY • link 6.3 years ago by cfos4698 ★ 1.1k

score 3 · Answer 4 · 2019-05-07

3

Entering edit mode

6.1 years ago

waldeyr ▴ 30

samtools mpileup -uf my_reference.fna my_file.bam | bcftools view -cg - | vcfutils.pl vcf2fq > my_consensus.fq

ADD COMMENT • link 6.1 years ago by waldeyr ▴ 30

score 2 · Answer 5 · 2019-05-07

2

Entering edit mode

6.1 years ago

trausch ★ 1.9k

Alfred has a consensus mode that extracts all reads at a given alignment position and then runs a multiple sequence alignment computation with consensus generation. It's primarily for long reads but I think it also works for short reads.

alfred consensus -t ill -f bam -p chr4:500500 input.bam

ADD COMMENT • link 6.1 years ago by trausch ★ 1.9k

score 2 · Answer 6 · 2019-05-12

2

Entering edit mode

6.1 years ago

chen ★ 2.5k

try gencore: https://github.com/OpenGene/gencore

Generate consensus reads to reduce sequencing noises and remove duplications

ADD COMMENT • link 6.1 years ago by chen ★ 2.5k

score 2 · Answer 7 · 2019-05-13

2

Entering edit mode

6.1 years ago

guillaume.rbt ★ 1.0k

you can use GATK FastaAlternateReferenceMaker to generate the consensus sequence based on a SNP calling : https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_fasta_FastaAlternateReferenceMaker.php

ADD COMMENT • link 6.1 years ago by guillaume.rbt ★ 1.0k

score 2 · Answer 8 · 2023-06-20

samtools mpileup -uf reference.fasta sorted_aligned_reads.bam | bcftools call -c | vcfutils.pl vcf2fq > consensus.fastq

Here's a breakdown of the command:

samtools mpileup: Generates a pileup of aligned reads at each position in the reference genome.

-u: Output in uncompressed BAM format.

-f reference.fasta: Specifies the reference genome in FASTA format.

sorted_aligned_reads.bam: The sorted and indexed BAM file.

bcftools call: Calls the consensus genotype for each position based on the pileup.

-c: Output the consensus genotype.

vcfutils.pl vcf2fq: Converts the consensus genotype in VCF format to FASTQ format, representing the consensus sequence.

consensus.fastq: The output file containing the consensus sequence in FASTQ format.

Please make sure to replace reference.fasta with the filename of your reference genome and sorted_aligned_reads.bam with the appropriate name of your sorted and indexed BAM file.

After running this script, you should obtain the consensus sequence in the consensus.fastq file. If you prefer a FASTA format instead of FASTQ, you can use tools like seqtk or fastq_to_fasta to convert the FASTQ file to FASTA format if needed.

score 2 · Answer 9 · 2023-06-28

2

Entering edit mode

23 months ago

jkbonfield ★ 1.3k

Since this was written, samtools now has a samtools consensus command. See the man page for controlling the coordinate system and whether indels are there.

It's much faster than the bcftools route, but potentially not as accurate in some scenarios (and more accurate in others, particularly with long read data). Something to try anyway.

ADD COMMENT • link 23 months ago by jkbonfield ★ 1.3k

0

Entering edit mode

This is interesting, thank you. Do you have any further information on why it might be more accurate with long read data? And what roughly constitutes 'long' in this case? For example, we have reads derived from amplicon sequencing using ONT that are on average 1.2 kb in length, would you expect a possible improvement with samtools consensus compared to the other approach of SNP calling --> bcftools consensus?

ADD REPLY • link 23 months ago by cfos4698 ★ 1.1k

1

Entering edit mode

A late reply I know. The main difference is the error models associated with long-read technologies - the actual read length doesn't matter really here. Consensus generation assuming most errors are substitution ones isn't so realiable when many newer technologies have indel as the primary error instead.

ADD REPLY • link 20 months ago by jkbonfield ★ 1.3k

0

Entering edit mode

Can anyone tell me why there is NNNNNN= no of gaps present in the final fasta file while converting bam into fasta with samtools consensus command. Thank you

ADD REPLY • link 9 months ago by skdv2522 • 0

0

Entering edit mode

N is produced when the caller isn't confident enough on the base call. This could be a number of reasons, including

No coverage (below min-depth, but it defaults to 1)
Insufficient percentage of bases in agreement (-c option, when using the simple consensus mode). Similarly highest to 2nd highest ratio (-H)
Low quality cutoff for the default bayesian mode. Try -C 0
Automatically adjusted qualities reducing the confidence of the consensus caller, for example due to low mapping quality of alignments or a region with an abnormally high number of discrepancies in alignments (NM tag). This is controlled a myriad of options, such as --no-use-MQ, --no-adj-MQ, etc. See the usage statement.
Incorrect configuration options; try starting with one of the presets such as -x hifi.

Basically, look at the man page and have a play with the options to see what works for you.

ADD REPLY • link 8 months ago by jkbonfield ★ 1.3k