Tutorial:Generating consensus sequence from bam file
3
53
Entering edit mode
5.7 years ago

One of the recurring questions on biostars is "How can I create a consensus sequence from my bam file?" A variation of these question is "How to get fasta out of bam file?".

The rational behind this question is usually, to get the sequence for a given region of interest of my sample, so that one can do a multiple alignment between different samples/species.

The answer to this question is a simple two-step workflow:

  1. Call variants
  2. Include the variants into the reference sequence

Do the variant calling step with your favorite program/workflow, e.g. with bcftools:

$ bcftools mpileup -Ou -f ref.fa input.bam | bcftools call -Ou -mv | bcftools norm -f ref.fa -Oz -o output.vcf.gz

In the end, one needs a valid, sorted vcf file which is compressed with bgzip and indexed with tabix.

I also recommend to normalize your vcf file (In the above command this is done by bcftools norm).

If your vcf isn't already compressed you can do this with:

$ bgzip -c output.vcf > output.vcf.gz

And indexing is done by:

$ tabix output.vcf.gz

Now we are ready to create our consensus sequence. The tool of choice is bcftools consensus.

If one like to create a complete new reference genome, based on the called variants, this is done by:

$ bcftools consensus -f ref.fa output.vcf.gz > out.fa

If you are interested in only a given region:

$ samtools faidx ref.fa 8:11870-11890 | bcftools consensus output.vcf.gz -o out.fa

You can also create consensus sequences for multiple regions, if you provide a bed file to samtools faidx:

$ samtools faidx -r regions.bed ref.fa  | bcftools consensus output.vcf.gz -o out.fa

Have fun :)

fin swimmer

Edit by GenoMax: Added a missing - in first bcftools norm command.

fasta consensus bam • 17k views
ADD COMMENT
1
Entering edit mode

Thanks for this! I have a question about tweaking the commands to have the consensus sequence report a "wobble" base in cases of heterozygosity, for example an R would be reported where a SNP is heterozygous for G/A. Do you think this is possible to implement in your pipeline? I'm writing in the context of healthy somatic human genomes (diploids).

ADD REPLY
0
Entering edit mode

According to the manual this should be possible with the -I argument:

-I, --iupac-codes
   output variants in the form of IUPAC ambiguity codes
ADD REPLY
1
Entering edit mode

This seems like a very long-winded solution and doesn't apply if you don't have a reference to hand.

However that said, if you don't then your aligned BAM was probably aligned against a de-novo assembly and somewhere there should be a consensus lurking around.

I've generally done some trivial counting on the mpileup output, line by line, to call the consensus directly without going via vcf to start with and without needing a reference file. I should probably write such a tool and add it to samtools as it's a recurring question and can be done MUCH faster than needing to use bcftools.

ADD REPLY
1
Entering edit mode

Hello jkbonfield ,

I've generally done some trivial counting on the mpileup output, line by line, to call the consensus directly without going via vcf to start with and without needing a reference file.

  • How can you produce a mpileup if you don't have a reference?
  • Trivial counting might work if you have perfect reads and simple regions. But finding out what is the real sequence for your sample is much more that just counting and hence you are doing variant calling.

I should probably write such a tool and add it to samtools.

I would always recommend if there is a widely used tool to use it instead of writing your own. It is very likely that one introduce much more errors by writing its own tool. If I'm right, you are one of the active contributors to samtools? So this might be an exception.

fin swimmer

ADD REPLY
1
Entering edit mode

"samtools mpileup foo.bam" will produce the columns showing what bases are aligned at each position. The reference column will be N, but that's irrelevant if you want the consensus (unless you're planning on using imputation to use reference in zero coverage regions, which can be useful in some situations).

As for not writing my own software - well that's what I do for a living (and yes much of it ends up in htslib and samtools). :-) There may well be tools out there already to generate consensus; bound to be infact, including my own hacky awk and perl one-liners, but having everything in the same package is nice for usability and discoverability.

ADD REPLY
1
Entering edit mode

Thank you for this! I am trying to get the mtGenome sequence for my samples to do a MSA, but maybe I'm not understanding what the output for this should look like. When I try to get the consensus for my chrM (mtGenome), the fasta file only contains 1 sequence. I thought it was supposed to get the mtGenome sequences of all my samples in a single fasta file, right? I don't know what I'm doing wrong :(

ADD REPLY
2
Entering edit mode

As the title suggests you get a consensus i.e. one sequence. When compared to the reference you started with this consensus will have SNP's where there are differences between your data and the reference.

You will need to generate a consensus independently for each sample if you have more than one sample.

ADD REPLY
0
Entering edit mode

Oh, since I never saw the -sample flag with the instructions I've seen around, I assumed that it was able to deal with a multi sample VCF without it and it would do for all the samples! Thanks for clarifying that.

ADD REPLY
2
Entering edit mode

It may be possible to do so but the example shown in tutorial is for a single sample VCF.

ADD REPLY
1
Entering edit mode

Hi @zjamal.msbi19rcms, thanks. I would like to double check what the Oz does in the bcftools norm -f ref.fa Oz -o output.vcf.gz part. It's saying "no such file Oz", so isn't it meant to be a flag? as in -Oz?

ADD REPLY
2
Entering edit mode

Use a - before Oz. Looks like it s missing in the command above. We will confirm and update. Update made.

ADD REPLY
0
Entering edit mode

HI GenoMax oddly now when using the -Oz I am getting the error:

[mpileup] 1 samples in 1 input files
[E::fai_build3] Failed to open the FASTA file -Oz
Failed to load the fai index: -Oz

Why does it think -Oz is fasta file?

EDIT: Ah, I think after the -f there needs to be the ref.fa, since the -f flag requires a FASTA file after it

ADD REPLY
1
Entering edit mode

Have you indexed your fasta reference file?

ADD REPLY
0
Entering edit mode

Hey GenoMax , it worked after I just put the ref.fa after the -Oz flag

ADD REPLY
0
Entering edit mode

First of all thank you very much!! I'm having trouble with indexing my fasta file and I'm sure I have done something wrong because when I run mpileup I get this message "[E::fai_fetch] fai_fetch failed : unexpected end of file" or "[fai_fetch_seq] The sequence "CHRII" " not found. I do see that a .fai file is created when I run the command. From what I see, the reference genome has to be in fasta format, right? Does anyone know what could be happening? Thanks for any advice!!

ADD REPLY
0
Entering edit mode

Are you referring to the consensus fasta file that was produced or original reference? unexpected end of file would suggest that you may have a corrupt/truncated file.

ADD REPLY
0
Entering edit mode

The original reference

ADD REPLY
0
Entering edit mode

You can create an index by doing samtools faidx original.fa. If that generates an error then your fasta file may have some formatting issues.

ADD REPLY
0
Entering edit mode

Yes, I think my fasta has a problem (maybe not the same line size?) because I have tried with a different reference sequence and everything works. Thank you very much!!

ADD REPLY
0
Entering edit mode

Hi, I've managed to solve this problem and everything works right now. However, no variants apply when I run mpileup and consensus when I know that the strain I use has them. I think I'm missing something about how all of this works.

I explain it: when I align my sequences from NGS using bowtie (and using, let's say "reference genome 1"), I get a bam file in which I call variants using mpileup (and using reference genome 1). When running consensus, I also use reference genome 1. When doing this, no variants apply. However, if I "mix" reference genomes (i.e. using "reference genome 2" (from a totally different strain) when aligning and "reference genome 1" by variant calling, more than 2000 variants apply).

I see no point in "mixing" genomes so that variants apply, but I also cannot understand why option 1 doesn't work. Is there something I'm missing? Thank you very much!!

ADD REPLY
0
Entering edit mode

Hi finsswimmer ,

Thank you for this workflow. However, I am still unclear for the steps I should follow and has been unable to find an answer.

Question: I would like to generate a sequence for a gene of interest (means consensus of reads) from 10 bam files, which will give me 10 fasta files or all in 1 file with header: gene_bam1, gene_bam2......etc.

Tried:

  1. Installation of BamBam to use Bam2consensus, but did not work out as it had not been updated since 2016.
  2. Still most people use samtools mpileup but while using it, this comes: Note that using "samtools mpileup" to generate BCF or VCF files is now deprecated. To output these formats, please use "bcftools mpileup" instead.
  3. I am unsure on commands for bcftools as this http://samtools.github.io/bcftools/howtos/consensus-sequence.html has not been helpful in case of extracting gene info.

Overall, would you suggest the workflow for my question considering latest info/parameters.

Thanks.

ADD REPLY
0
Entering edit mode

The method described above will give you 10 fasta files since the method is for one sample.

What are your BAM files from? Entire genome (preferable) or just the gene (not recommended, if sequencing was for whole genome). While you could subset your BAM'ss to get gene region you are interested in it may be simpler to just do the consensus for the genome and then fish out the region you need from consensus.

ADD REPLY
0
Entering edit mode

@GenoMax BAM files are from RNASeq. I would like to extract the sequence of few genes, translate to proteins and compare among multiple samples.

ADD REPLY
0
Entering edit mode

I tried bcftools consensus and IGV "copy consensus" option. It seems IGV gives the required output while bcftools give consensus while considering both sample and reference, so almost same for all samples for a gene.

ADD REPLY
3
Entering edit mode
8 months ago

You can also generate consensus sequences directly from sam/bam files without calling variants first, and it is more accurate, particularly with homopolymer indels and technologies like Nanopore that incur false variable-length ones. Using BBTools:

consensus.sh in=mapped.sam ref=ref.fa out=consensus.fa

This approach builds paths through the reference according to the depth of aligned reads showing variants, and outputs the most common path. So, for example, if at some position there was both a SNP and a deletion, it would choose whichever event was most common... but more importantly, if you have 100 reads where 10% of them indicate you have 7 A's, 20% indicate 8 A's, 20% indicate 9 A's, 20% indicate 10 A's, 25% of them indicate 11 A's, and 5% indicate 12 A's, a variant-caller may not even call a variant in the first place. But if it does, it might call the most common (which is 11 A's) or if you set the threshold low enough, it would call 6 different overlapping variants.

BBTools uses the most common path through an indel, which in this case would be 9 A's since that's the 50th percentile of paths through the region. That doesn't mean it is correct, but I think it's a much better estimate than simply applying called variants.

ADD COMMENT
2
Entering edit mode
8 months ago
jkbonfield ★ 1.3k

For completeness, I'll also add that samtools consensus now exists as a sub-command (http://www.htslib.org/doc/samtools-consensus.html) which operates directly on the BAM file without a bcftools step, which makes it considerably faster. This is a true consensus, originally using an algorithm from the Staden Package (gap4 and gap5). It has options to report heterozygous bases too if desired, and options for controlling whether indels should be part of the consensus or not.

The reason for "or not" is to provide a better reference to realign against to improve alignments, but without changing coordinate systems. It can also be used as a better reference for reference-based compression in e.g. CRAM, although htslib can do that on-the-fly too with the embed_ref=2 option.

ADD COMMENT
0
Entering edit mode

Hello, do you know whether samtools consensus can output two consensus sequences for diploid, rather than one consensus with heterozygous bases? Thanks!

ADD REPLY
1
Entering edit mode
8 months ago
Michael 55k

Despite or rather because of all the upvotes I feel compelled to state very serious caveats with this work-flow. It is so problematic that I think it shouldn't be used at all IMO. I am uncertain how many people have fallen for using it without being critical about what it does.

Highly noisy variant data

The workflow described is now outdated and includes no variant filtration. Proper variant calling includes filtering steps. Without it the result is going to be very noisy. Also, there are more accurate tools for variant calling today.

The "consensus" is not a consensus

When reading the manual for bcftools consensus, it will become clear that the outcome is not a consensus (aka. agreement) but quite the opposite: An ALT-genome is a genome with the ALTernative allele inserted at any variable site irrespective of quality coverage or allele frequency. Imagine a decision based on a majority vote (consensus) versus a system where a single vote determines decision-making if it is just deviating enough from the mainstream (ALT-genome). Both may be useful, but you can imagine that the outcomes will be very different.

Conclusion

The combination of lack of sub-par variant calling, lack of filtration, and introduction of ALT-alleles into the reference produces very noisy data. Calling the function "consensus" is a misnomer that could be considered a bug in bcftools. The results produced are not comparable to a consensus-based genome assembly

ADD COMMENT
0
Entering edit mode

Thank you; So what do you recommend if i have to draw a phylogenetic tree of some bacterial strain? how do you get the multiple alignment input from the vcf file if you don't execute 'consensus;?

ADD REPLY
0
Entering edit mode

You have several options:

  • generate de novo genome assemblies and align them
  • Use samtools consensus (this should be more robust)
  • Use a robust variant calling approach and use bcftools "consensus" with specifying which "haplotype" (ALT or REF) to take
  • Filter and extract only the variants, possibly also applying LD pruning (user ascertainment correction if only using invariant positions) and removing variants under positive selection.

Possibly try different options. As long as you don't mix different approaches thinking they are the same you should be fine.

ADD REPLY
0
Entering edit mode

Thank you Michael The only problem is that I ended with a very large multi fasta file (after running bcftools consensus on each sample and concate them in a same file) when I need to align them It took one hour to make alignement with mafft (and Mega crash). But I don't think there is an another way to do that

ADD REPLY
0
Entering edit mode

Some reference-based methods don't need to be aligned because they are already. That should include samtools consensus and bcftools consensus methods, at least if they only include SNPs. If you have de-novo assemblies instead, you could try NASP to make a SNP matrix. If mafft works in you case, that is fine of course.

ADD REPLY
0
Entering edit mode

Thank you very much

ADD REPLY

Login before adding your answer.

Traffic: 2362 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6