bcftools consensus fasta index not equivalent to bowtie index built using the same fasta
1
0
Entering edit mode
2.2 years ago

Hi everyone,

Hope everyone is doing well. I am writing this post because I have been stuck on this problem for the past week and I do not see it getting better. I am working on making a new genome assembly by making a consensus reference sequence using variance called within the same genome. Here is my best attempt to communicate the conundrum:

I start with a VCF.

Step I:

Using the hg38 standard reference GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz (http://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/analysisSet/) and the variants from the VCF, I use the following command to build a consensus fasta.

HAPLO = 1 ( to get the first allele for all coordinates )

STDREF = GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz 

CHAIN = output.chain ( There is an output chain file produced which will help in lifting over from hg38 to new assembly )

CONCENSUS = output.concensus.fa ( This is the output concencus sequence )

 bcftools consensus --chain ${CHAIN} --fasta-ref ${STDREF} --haplotype ${HAPLO} -o ${CONCENSUS} ${VCF} 2&> ${LOG}

Step II:

Using the fasta file generated, I build a bowtie index

BOWTIE_OUTDIR = output.bowtie.dir/bowtie.prefix ( bowtie requires a path and a prefix for calculating bowtie index )

bowtie2-build --threads 10  ${CONCENSUS} ${BOWTIE_OUTDIR}

In the subsequent steps, I realise something is off with mapping to this bowtie index. So, I rebuilt the fatsa using the bowtie index for some sanity checks.

bowtie2-inspect ${BOWTIE_OUTDIR}

I then checked if the coordinates and sequences in the two fasta files, consensus fasta and bowtie fasta, are the same and to my surprise, they weren't, i.e.

**Consensus Fasta**
>chr22:43759028-43759179
ATCTGGCTTGAGATTCTCAGCCATTATCTTTCCAAATATGACTTCTGCCCTGTTTTTTCT
CTTATACTGCTATGGGTTGAAGGTGTTCCTCCAAAATTTAGGTGTTGCCACTGTGAGAGG
ATTAAGAGGTGGTTAGGCCTGAAGGTTCCGCC

**Bowtie Fasta**
>chr22:43759028-43759179
CTGCTATGGGTTGAAGGTGTTCCTCCAAAATTTAGGTGTTGCCACTGTGAGAGGATTAAG
AGGTGGTTAGGCCTGAAGGTTCCGCCCTCTTGAATAGGAGTAGGTGCATTTTTTTTTTTT
TTTTCCTGAGATGGGAGTCTCACTCTGTTGCC

My question is, how is it possible that the same fasta when built into a bowtie index would be different? Essentially the mystery, if you may, comes down to :

FASTA 1 ---> BOWTIE BUILD ---> BOWTIE INDEX ---> BOWTIE INSPECT ---> FASTA 2

but

FASTA 1 != FASTA 2

Thank you for your time. I would appreciate it deeply if anyone can point me towards relevant articles or if anyone could give me some leads on this problem.

bcftools fasta bowtie • 1.6k views
ADD COMMENT
1
Entering edit mode

Building a consensus sequence is not a well-defined procedure but an approximation. There could be multiple consensus sequences that are equally valid.

There is no mathematical representation or definition for what "consensus building" means. Every tool is free to choose what they think is the right approach.

ADD REPLY
1
Entering edit mode

Specific issue seems to be that a fasta sequence (it is the consensus generated by bcftools) that was used to build a bowtie index can't be faithfully recreated using bowtie2-inspect command.

ADD REPLY
0
Entering edit mode

I realise something is off with mapping to this bowtie index

Can you tell us what was off?

ADD REPLY
0
Entering edit mode

GenoMax, Thank you for reacting and very sorry for the delay in getting back to you. Following up on your question, the same coordinates output different sequences for the consensus fasta and bowtie fasta. The bowtie fasta was made using the bowtie index for the consensus fasta, so it was peculiar why they would not be equivalent.

ADD REPLY
0
Entering edit mode
20 months ago

Thanks for your reply Istvan Albert, and sorry about the delay. Initially, that was also my thought as well but looking at a few anecdotes made it seem like it was systematic. As GenoMax pointed out, the problem indeed was the translation of fasta between bcftools and bowtie2.


The VCF file used for making the consensus sequence had entries like:

chr1 1481959 C *, T (Example)

Assuming reference genome position chr1:1481958-1481961 is GCAG, this resulted in the following consensus sequence corresponding to the same position (only using the first allele):

Consensus sequence using bcftools: G*AG

Bowtie, on the other hand, removes these * and adjusts the coordinates accordingly. This is the reason why I was seeing systematic shifts around overlapping deletions.

The Variant Call Format (VCF) Version 4.2 Specification (https://samtools.github.io/hts-specs/VCFv4.2.pdf) uses * to indicate that the allele is missing due to an upstream deletion.

The solution in my case was to remove all * from the consensus fasta.

ADD COMMENT
0
Entering edit mode

thanks for following up. I will move your post to an answer as it is instructive and answers the original question

in a nutshell, the answer was that the consensus FASTA generated by bcftools consensus uses an extended alphabet (an extended FASTA) that contains characters that will be ignored by bowtie

ADD REPLY

Login before adding your answer.

Traffic: 1247 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6