Question

Get the consensus sequence ... (I know, should be easy)

0

Entering edit mode

10.0 years ago

Marvin ▴ 220

Hello,

I have been trying to generate a consensus sequence with samtools/bcftools for a couple of days now and I'm stuck.

First my test data:

Marvin$ cat reference.fa
>myseq
TTTTGTTTTTGTTTTTAACCACTTTTCCACAAGAAAAGATGCTATCGAATCTCTTGATTAACAGAATTTA
TCATCTTTTCCACAAATTGTGGAAAACTTATGTCCACATTGTGGACTCATCTTTTTTCACCTGTGGAAAA
CTTTCTCAATTTATGGTAAAATTTTCTTATTACAATCTTGATAGGAGTACACTATGACAGAAAATGAACA
AATTTTTTGGAATAGGGTTTTAGAACTTGCACAAAGTCAATTAAAGCAAGCGACATTTGATTTTTTCGTT
TCAGATGCTAAATTATTGAAAGTTGAAGGAAATATTGCGACTATCCTTCTTGATGATATGAAAAAATTAT

Marvin$ cat reads.fastq
@read1
TTTTGTTTTTGTTTTTAACCACTTTTCCACAAGAAAAGATGCTATCGAATCTCTTGATTAACAGAATTTA
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@read2
AAGAAAAGATGCTATCGAATCTCTTGATTAACAGAATTTATCATCTTTTCCACAAATTGTGGAAAACTTA
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@read3
ACAGAATTTATCATCTTTTCCACAAATTGTGGAAAACTTATGTCCACATTGTGGACTCATCTTTTTTCAC
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

I have copied parts of the reference to generate these reads. They span a contig across region 0-130 in the reference.

Goal: Reconstruct the consensus sequence of that region (length 130).

My solution/try:

bwa index reference.fa #index the reference
bwa mem reference.fa reads.fastq > aln.sam #map the reads
samtools view -bh aln.sam > aln.bam #convert sam to bam
samtools sort aln.bam aln_sorted #sort the bam, output = aln_sorted.bam
samtools index aln_sorted.bam #don't remember which downstream program needed this
bedtools merge -i aln_sorted.bam #shows me the contigs in the mapping. only 1 present at myseq:0-130

Then I create the vcf file:

samtools view -h aln_sorted.bam "myseq:0-130" | samtools mpileup -v - > my_vcf_file.bgz

Now I index that file with tabix, because earlier I've had an error message which told me that the index was missing:

tabix -p vcf my_vcf_file.bgz

And now I try to finally get the consensus:

bcftools consensus -i my_vcf_file.bgz #option -i to get iupac codes at ambiguous sites

But it doesn't work, it throws the Usage page at me. And it says:

Usage:   bcftools consensus [OPTIONS] <file.vcf>

Actually all it should need is the input file (which I have provided), because OPTIONS are optional right? However if I try it with the parameter -f, I get:

Marvin$ bcftools consensus -i -f reference.fa my_vcf_file.bgz
>myseq
The fasta sequence does not match the REF allele at myseq:1:
   .vcf: [N]
   .vcf: [N] <- (ALT)
   .fa:  [T]TTTGTTTTTGTTTTTAACCACTTTTCCACAAGAAAAGATGCTATCGAATCTCTTGATTAACAGAATTTA

I don't understand this error and I am not sure if I am moving into the right direction, I find no way out of this maze, please help

bcftools samtools tabix • 5.8k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.0 years ago by Marvin ▴ 220

0

Entering edit mode

You'll want to use bcftools call on the output of mpileup (assuming you're using a recent version of samtools). The -f option is only optional if you want useless output (the documentation should make this clearer).

ADD REPLY • link 10.0 years ago by Devon Ryan 105k

0

Entering edit mode

I managed to get a consensus by now via:

samtools view -h aln_sorted.bam "myseq:0-130" | samtools mpileup -vf reference.fa - | bcftools call -m -Oz -o calls.vcf.gz -
tabix calls.vcf.gz
bcftools consensus -f reference.fa -i calls.vcf.gz

However the output is not what I need. I have a site in the alignment (it's not in the sequences of my original post because I have modified the test reads by now) where I have the following situation:

The reference says T, 2 reads say C and one read says G.

In the consensus I get the IUPAC code Y, which means 'T or C'.

Side question: Why doesn't he say B (IUPAC for 'C or G or T')?

Main question: So he says it's either the reference or the base with the highest support. BUT: Since I am mapping aDNA reads to a modern reference, I trust the signal in the reads more than the reference. I do not want the reference base to be considered when building the consensus. How do I achieve this?

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.0 years ago by Marvin ▴ 220

Ram · Answer 1 · 2015-07-07

0

Entering edit mode

10.0 years ago

Lesley Sitter ▴ 610

I'm not familiar with this method of using bcftools to create a consensus sequence. What I did when I needed a consensus was first building a blast database from fasta A, then using blastn to blast fasta B to it. Take output format 6 so that you get a nice tabled format.

Then using some sort of filtering method (I used R, but excel could also work) filter out alignments that have no gaps, an x-amount of mismatches (if you want them to be identical you only take alignments with score 0 in that column) and a very high ID match.

You should then end up with a list that contains contigs names, region that aligns and a several scoring columns like bit-score, P value, etc. You can then easily write a python script that looks for a specific header in one of the fastas and extract the location from the sequence based on the filtered blastn output.

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.0 years ago by Lesley Sitter ▴ 610

0

Entering edit mode

That's going to take forever for NGS-sized datasets.

ADD REPLY • link 10.0 years ago by Devon Ryan 105k

0

Entering edit mode

I did it on demultiplexed reads (so about 2 million reads) against an assembly of 280mb with 70000 contigs and it only takes a couple of hours on 8 threads... and seeing that he has working on this for days, a couple of hours does not seem so bad :P

ADD REPLY • link 10.0 years ago by Lesley Sitter ▴ 610

0

Entering edit mode

A small dataset like yours would take maybe 10 or 15 minutes with bwa/samtools if one uses the commands properly.

ADD REPLY • link 10.0 years ago by Devon Ryan 105k