Hello,
I have been trying to generate a consensus sequence with samtools/bcftools for a couple of days now and I'm stuck.
First my test data:
Marvin$ cat reference.fa
>myseq
TTTTGTTTTTGTTTTTAACCACTTTTCCACAAGAAAAGATGCTATCGAATCTCTTGATTAACAGAATTTA
TCATCTTTTCCACAAATTGTGGAAAACTTATGTCCACATTGTGGACTCATCTTTTTTCACCTGTGGAAAA
CTTTCTCAATTTATGGTAAAATTTTCTTATTACAATCTTGATAGGAGTACACTATGACAGAAAATGAACA
AATTTTTTGGAATAGGGTTTTAGAACTTGCACAAAGTCAATTAAAGCAAGCGACATTTGATTTTTTCGTT
TCAGATGCTAAATTATTGAAAGTTGAAGGAAATATTGCGACTATCCTTCTTGATGATATGAAAAAATTAT
Marvin$ cat reads.fastq
@read1
TTTTGTTTTTGTTTTTAACCACTTTTCCACAAGAAAAGATGCTATCGAATCTCTTGATTAACAGAATTTA
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@read2
AAGAAAAGATGCTATCGAATCTCTTGATTAACAGAATTTATCATCTTTTCCACAAATTGTGGAAAACTTA
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@read3
ACAGAATTTATCATCTTTTCCACAAATTGTGGAAAACTTATGTCCACATTGTGGACTCATCTTTTTTCAC
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
I have copied parts of the reference to generate these reads. They span a contig across region 0-130 in the reference.
Goal: Reconstruct the consensus sequence of that region (length 130).
My solution/try:
bwa index reference.fa #index the reference
bwa mem reference.fa reads.fastq > aln.sam #map the reads
samtools view -bh aln.sam > aln.bam #convert sam to bam
samtools sort aln.bam aln_sorted #sort the bam, output = aln_sorted.bam
samtools index aln_sorted.bam #don't remember which downstream program needed this
bedtools merge -i aln_sorted.bam #shows me the contigs in the mapping. only 1 present at myseq:0-130
Then I create the vcf file:
samtools view -h aln_sorted.bam "myseq:0-130" | samtools mpileup -v - > my_vcf_file.bgz
Now I index that file with tabix, because earlier I've had an error message which told me that the index was missing:
tabix -p vcf my_vcf_file.bgz
And now I try to finally get the consensus:
bcftools consensus -i my_vcf_file.bgz #option -i to get iupac codes at ambiguous sites
But it doesn't work, it throws the Usage page at me. And it says:
Usage: bcftools consensus [OPTIONS] <file.vcf>
Actually all it should need is the input file (which I have provided), because OPTIONS are optional right?
However if I try it with the parameter -f
, I get:
Marvin$ bcftools consensus -i -f reference.fa my_vcf_file.bgz
>myseq
The fasta sequence does not match the REF allele at myseq:1:
.vcf: [N]
.vcf: [N] <- (ALT)
.fa: [T]TTTGTTTTTGTTTTTAACCACTTTTCCACAAGAAAAGATGCTATCGAATCTCTTGATTAACAGAATTTA
I don't understand this error and I am not sure if I am moving into the right direction, I find no way out of this maze, please help
You'll want to use
bcftools call
on the output of mpileup (assuming you're using a recent version of samtools). The -f option is only optional if you want useless output (the documentation should make this clearer).I managed to get a consensus by now via:
However the output is not what I need. I have a site in the alignment (it's not in the sequences of my original post because I have modified the test reads by now) where I have the following situation:
The reference says T, 2 reads say C and one read says G.
In the consensus I get the IUPAC code Y, which means 'T or C'.
Side question: Why doesn't he say B (IUPAC for 'C or G or T')?
Main question: So he says it's either the reference or the base with the highest support. BUT: Since I am mapping aDNA reads to a modern reference, I trust the signal in the reads more than the reference. I do not want the reference base to be considered when building the consensus. How do I achieve this?