Get the consensus sequence ... (I know, should be easy)
1
0
Entering edit mode
9.5 years ago
Marvin ▴ 220

Hello,

I have been trying to generate a consensus sequence with samtools/bcftools for a couple of days now and I'm stuck.

First my test data:

Marvin$ cat reference.fa
>myseq
TTTTGTTTTTGTTTTTAACCACTTTTCCACAAGAAAAGATGCTATCGAATCTCTTGATTAACAGAATTTA
TCATCTTTTCCACAAATTGTGGAAAACTTATGTCCACATTGTGGACTCATCTTTTTTCACCTGTGGAAAA
CTTTCTCAATTTATGGTAAAATTTTCTTATTACAATCTTGATAGGAGTACACTATGACAGAAAATGAACA
AATTTTTTGGAATAGGGTTTTAGAACTTGCACAAAGTCAATTAAAGCAAGCGACATTTGATTTTTTCGTT
TCAGATGCTAAATTATTGAAAGTTGAAGGAAATATTGCGACTATCCTTCTTGATGATATGAAAAAATTAT

Marvin$ cat reads.fastq
@read1
TTTTGTTTTTGTTTTTAACCACTTTTCCACAAGAAAAGATGCTATCGAATCTCTTGATTAACAGAATTTA
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@read2
AAGAAAAGATGCTATCGAATCTCTTGATTAACAGAATTTATCATCTTTTCCACAAATTGTGGAAAACTTA
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@read3
ACAGAATTTATCATCTTTTCCACAAATTGTGGAAAACTTATGTCCACATTGTGGACTCATCTTTTTTCAC
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

I have copied parts of the reference to generate these reads. They span a contig across region 0-130 in the reference.

Goal: Reconstruct the consensus sequence of that region (length 130).

My solution/try:

bwa index reference.fa #index the reference
bwa mem reference.fa reads.fastq > aln.sam #map the reads
samtools view -bh aln.sam > aln.bam #convert sam to bam
samtools sort aln.bam aln_sorted #sort the bam, output = aln_sorted.bam
samtools index aln_sorted.bam #don't remember which downstream program needed this
bedtools merge -i aln_sorted.bam #shows me the contigs in the mapping. only 1 present at myseq:0-130

Then I create the vcf file:

samtools view -h aln_sorted.bam "myseq:0-130" | samtools mpileup -v - > my_vcf_file.bgz

Now I index that file with tabix, because earlier I've had an error message which told me that the index was missing:

tabix -p vcf my_vcf_file.bgz

And now I try to finally get the consensus:

bcftools consensus -i my_vcf_file.bgz #option -i to get iupac codes at ambiguous sites

But it doesn't work, it throws the Usage page at me. And it says:

Usage:   bcftools consensus [OPTIONS] <file.vcf>

Actually all it should need is the input file (which I have provided), because OPTIONS are optional right? However if I try it with the parameter -f, I get:

Marvin$ bcftools consensus -i -f reference.fa my_vcf_file.bgz
>myseq
The fasta sequence does not match the REF allele at myseq:1:
   .vcf: [N]
   .vcf: [N] <- (ALT)
   .fa:  [T]TTTGTTTTTGTTTTTAACCACTTTTCCACAAGAAAAGATGCTATCGAATCTCTTGATTAACAGAATTTA

I don't understand this error and I am not sure if I am moving into the right direction, I find no way out of this maze, please help

bcftools samtools tabix • 5.3k views
ADD COMMENT
0
Entering edit mode

You'll want to use bcftools call on the output of mpileup (assuming you're using a recent version of samtools). The -f option is only optional if you want useless output (the documentation should make this clearer).

ADD REPLY
0
Entering edit mode

I managed to get a consensus by now via:

samtools view -h aln_sorted.bam "myseq:0-130" | samtools mpileup -vf reference.fa - | bcftools call -m -Oz -o calls.vcf.gz -
tabix calls.vcf.gz
bcftools consensus -f reference.fa -i calls.vcf.gz

However the output is not what I need. I have a site in the alignment (it's not in the sequences of my original post because I have modified the test reads by now) where I have the following situation:

The reference says T, 2 reads say C and one read says G.

In the consensus I get the IUPAC code Y, which means 'T or C'.

Side question: Why doesn't he say B (IUPAC for 'C or G or T')?

Main question: So he says it's either the reference or the base with the highest support. BUT: Since I am mapping aDNA reads to a modern reference, I trust the signal in the reads more than the reference. I do not want the reference base to be considered when building the consensus. How do I achieve this?

ADD REPLY
0
Entering edit mode
9.5 years ago
Lesley Sitter ▴ 610

I'm not familiar with this method of using bcftools to create a consensus sequence. What I did when I needed a consensus was first building a blast database from fasta A, then using blastn to blast fasta B to it. Take output format 6 so that you get a nice tabled format.

Then using some sort of filtering method (I used R, but excel could also work) filter out alignments that have no gaps, an x-amount of mismatches (if you want them to be identical you only take alignments with score 0 in that column) and a very high ID match.

You should then end up with a list that contains contigs names, region that aligns and a several scoring columns like bit-score, P value, etc. You can then easily write a python script that looks for a specific header in one of the fastas and extract the location from the sequence based on the filtered blastn output.

ADD COMMENT
0
Entering edit mode

That's going to take forever for NGS-sized datasets.

ADD REPLY
0
Entering edit mode

I did it on demultiplexed reads (so about 2 million reads) against an assembly of 280mb with 70000 contigs and it only takes a couple of hours on 8 threads... and seeing that he has working on this for days, a couple of hours does not seem so bad :P

ADD REPLY
0
Entering edit mode

A small dataset like yours would take maybe 10 or 15 minutes with bwa/samtools if one uses the commands properly.

ADD REPLY

Login before adding your answer.

Traffic: 1944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6