Entering edit mode
6.0 years ago
DNAngel
▴
250
Hi all,
Alright so I'll edit my question here since my other post got closed. After generating consensus sequences with bcftools and vcfutils I am unsure how to get sequences without IUPAC ambiguity. My code that works in generating sequences:
samtools mpileup -Q 20 -q 20 -d 5000 -uf ref.fa sorted.bam | \
bcftools call -c | \
vcfutils.pl vcf2fq -d 10 -Q 20 -l 1 > output_consensus.fq
seqtk seq -a output_consensus.fq > output_consensus.fa
The alignment is filled with IUPAC codes but I need it to output the most called base per position or at least choose one of the bases if it happened that there was a 50/50 between bases. Using samtools and bcftools v 1.8.
Thanks!
Mapping reads to a reference is also called alignment. Therefore, mapped reads in a SAM file are aligned reads. A BAM file is a binary version of the SAM format. Typically, to save time and disk space, one does:
instead of
bwa mem (...) > aligned.sam
. To convert your SAM to BAM, dosamtools view -o aligned.bam aligned.sam
.Another thing:
While true that this saves time, it is actually a pretty bad idea to align against a subset of the genome. The reason is that this may lead to false-positive alignments. The aligner always try to match every read to the reference and outputs the best hit. Given some of your reads are actually off-targets from the capturing (and this always happens during library prep), these might falsely be aligned to the exome rather than their true origin somewhere else in the genome. Therefore, it is good practice to align against the entire reference genome to avoid false-positives.
I understand that it would be better to do the whole alignment but this is part of a pipeline test (and my committee wants me to show both methods) so I am doing that. I will be doing the full alignment after once I even get enough space on a server. Right now on my end I need to align to just a few genes. I've tried many ways so far with samtools and bcftools, even trying to generate vcf files, but nothing has worked. Right now I just want to make a consensus sequence but I am at a loss at how to write my bcftools command.
So far the only thing that seemed to give the LEAST errors was:
But it says that it cannot recognize vcf.gz as a file which I don't understand why!
In case the content causes confusion in the future, here is the original content as it existed on 1/3/2019, 2:30 PM Eastern Time, before it was edited away:
Why you shouldn't do this ATpoint already explained. What I like to add is, that you don't save this much time.
bwa
(and other mapper/aligner) using a data structure of the reference that makes it possible, that the time needed to find the correct position is independent of the reference size. Instead it depends only on the length of the read you want to align. This data structure have to get loaded into the memory and the beginning and that's the only point where a larger reference needs more time. In comparison to the whole runtime of the alignment this is very small.Concerning your question, this is what I would do:
fin swimmer
I think first you can create consensus sequence and then align your reads using bwa.
That doesn't really make sense, can you elaborate on what you mean?
I've ran into a new issue, now I cannot create a consensus sequence at all and I don't understand why.
I have my indexed (from bwa) reference files and my sorted alignment bam files for 30 species in the same directory. I am trying to make consensus sequences of them all. I am even trying to do this now with vcf (even though I am not working on variant calling, I just want a consensus file).
My command:
samtools mpileup -uf ref_geneA.fa sp1.aln.sorted.bam | bcftools call -c - | vcfutils.pl vcf1fq > sp1_consensus.fq
The error message:
What is the uninitalized value $1? Isn't vcfutils.pl just converting the vcf into a fastq file which I can later convert to fasta?
Make sure you have the most recent version of
bcftools
(v1.9).bcftools mpileup -Ou -f ref.fa aln.bam
collects metrics about each covered reference base. This information are used to find suspicious sitesbcftools call -Ou -mv
calls variants on suspicious sitesbcftools norm -f ref.fa
normalize variantsbcftools consensus -f ref.fa -o consensus.fa
creates aconsensus.fa
based on the variants foundfin swimmer
Hi, I know you mentioned to me earlier in another site that bcftools consensus will fill in gaps with bases from the reference...but I know from other scripts where colleagues have worked with paired-end data that this is not true. They were able to get the consensus sequence of their species1.bam files only, with gaps to show no coverage at a base from the reference. In my case (single-end data), my consensus sequence is merging the contigs from species1.bam (from the calls.vcf.gz file) and any gaps (which I know must exist and confirmed during tview) are filled with bases from the reference. There has to be a way to just merge contigs from the mpileup file of species1.bam without incorporating bases from the reference fasta file.
Edit: I am using bcftools 1.9 and samtools 1.9
Hello DNAngel,
are you the same user as here?
I is possible to create such a reference sequence. But not with
bcftools
alone. What you have to do is:zcat Sample.per-base.bed.gz|awk '$4==0 {print}' > zero-coverage.bed
bedtools maskfasta -fi input.fa -bed zero-coverage.bed -fo output.fa
You can define any character you like for masking using the-mc
parameter. The default isN
output.fa
as the reference is thebcftools consensus
stepfin swimmer
Yes that is me! Thank you I will try this as well. Also, I think I may have gotten it with bcftools using call as you suggested but using -d 10 and -Q 20 settings to ensure high quality base calls are kept. And that (to me right now at least) makes the consensus only contain the high quality base calls from ALT. I will have to slowly go through both commands to ensure I am getting only my species bases.
Thank you fin swimmer for all the help :):):) saving my life here lol.
AFAIK, you can't use a modified reference genome with bcftools consensus. You'll get an error like "The fasta sequence does not match the REF allele at..." and the consensus output will die at that point (where the 'N' is in the reference). I
You CAN, however, generate the pileup with the masked reference (per sample), and then you're good on generating a consensus sequence using the same masked reference.
Hello DNAngel ,
please do not edit your posts in that way, that no one can see to which question already existing answers/comments belong.
vcfutils.pl vcf2fq
is not the way to go, because it:Please explain what's the problem with bcftools consensus and this additional description.
Thanks!
fin swimmer
Hi fin swimmer,
Sorry to reply to an old post but this so adequately fits my issue - I need to create a consensus sequence without IUPAC codes. But I can't use bcftools consensus because it will not generate a consensus sequence from a FASTA reference containing multiple sequences.
vcfutils.pl vcf2fq can deal with multi-sequence references but it uses IUPAC codes.
Is there any way to generate a consensus from a multi-sequence reference without IUPAC codes?
Thank you so much for your help!
Hello CC,
can you add more information what you mean by this and how it looks like? A fasta file can contain multiple sequences, as long as they have different headers and
bcftools
should not have any problems with it.fin swimmer
Sure, here was my bcftools consensus approach:
However, when I do this using a multi-sequence reference fasta, only the first sequence comes out in the consensus.
I fixed this issue by using the vcfutils.pl vcf2fq approach, but then I have the IUPAC code issue...