Entering edit mode
6.3 years ago
ieie
▴
10
Dear all, I am trying to get the consensus sequence of a genome using:
~/samtools-1.3/samtools mpileup -Ou -f REF alnsorted.bam | bcftools call -mv -Oz -o 1aln_paired.vcf.gz
tabix aln_paired.vcf.gz
cat REF | bcftools consensus aln_paired.vcf.gz > consensus_paired.fa
I actually get a consensus sequence but when I perform the last line I get several times this message:
the site x overlaps with another variant, skipping...
I also get shorter consensus sequences, my reference is 159.295 bp long and for each constructed genome I get around 158.700 158.900 bp.
Does anybody know if these results are wrong or it can happen that I get a warning message and the consensus sequences are shorter than the reference? thanks a lot!
Just a guess: Are there lot's of deletions in your vcf?
If I understand the format correctly probably yes, as I have a lot of positions where there is this
Not sure if still on time, but you can try to workaround it by retaining only biallelic sites. Also, you can run a script to remove lines with more than one character in your sequence. This will remove indels. You can do it manually with
awk 'length($REF_column))==1' {print $0}' input.file | awk 'length($ALT_column))==1' {print $0}'
(Substitute the REF and ALT columns for your corresponding column numbers)