Hi everyone,
Hope everyone is doing well. I am writing this post because I have been stuck on this problem for the past week and I do not see it getting better. I am working on making a new genome assembly by making a consensus reference sequence using variance called within the same genome. Here is my best attempt to communicate the conundrum:
I start with a VCF.
Step I:
Using the hg38 standard reference GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz (http://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/analysisSet/) and the variants from the VCF, I use the following command to build a consensus fasta.
HAPLO = 1 ( to get the first allele for all coordinates )
STDREF = GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
CHAIN = output.chain ( There is an output chain file produced which will help in lifting over from hg38 to new assembly )
CONCENSUS = output.concensus.fa ( This is the output concencus sequence )
bcftools consensus --chain ${CHAIN} --fasta-ref ${STDREF} --haplotype ${HAPLO} -o ${CONCENSUS} ${VCF} 2&> ${LOG}
Step II:
Using the fasta file generated, I build a bowtie index
BOWTIE_OUTDIR = output.bowtie.dir/bowtie.prefix ( bowtie requires a path and a prefix for calculating bowtie index )
bowtie2-build --threads 10 ${CONCENSUS} ${BOWTIE_OUTDIR}
In the subsequent steps, I realise something is off with mapping to this bowtie index. So, I rebuilt the fatsa using the bowtie index for some sanity checks.
bowtie2-inspect ${BOWTIE_OUTDIR}
I then checked if the coordinates and sequences in the two fasta files, consensus fasta and bowtie fasta, are the same and to my surprise, they weren't, i.e.
**Consensus Fasta**
>chr22:43759028-43759179
ATCTGGCTTGAGATTCTCAGCCATTATCTTTCCAAATATGACTTCTGCCCTGTTTTTTCT
CTTATACTGCTATGGGTTGAAGGTGTTCCTCCAAAATTTAGGTGTTGCCACTGTGAGAGG
ATTAAGAGGTGGTTAGGCCTGAAGGTTCCGCC
**Bowtie Fasta**
>chr22:43759028-43759179
CTGCTATGGGTTGAAGGTGTTCCTCCAAAATTTAGGTGTTGCCACTGTGAGAGGATTAAG
AGGTGGTTAGGCCTGAAGGTTCCGCCCTCTTGAATAGGAGTAGGTGCATTTTTTTTTTTT
TTTTCCTGAGATGGGAGTCTCACTCTGTTGCC
My question is, how is it possible that the same fasta when built into a bowtie index would be different? Essentially the mystery, if you may, comes down to :
FASTA 1 ---> BOWTIE BUILD ---> BOWTIE INDEX ---> BOWTIE INSPECT ---> FASTA 2
but
FASTA 1 != FASTA 2
Thank you for your time. I would appreciate it deeply if anyone can point me towards relevant articles or if anyone could give me some leads on this problem.
Building a consensus sequence is not a well-defined procedure but an approximation. There could be multiple consensus sequences that are equally valid.
There is no mathematical representation or definition for what "consensus building" means. Every tool is free to choose what they think is the right approach.
Specific issue seems to be that a fasta sequence (it is the
consensus
generated bybcftools
) that was used to build a bowtie index can't be faithfully recreated usingbowtie2-inspect
command.Can you tell us what was off?
GenoMax, Thank you for reacting and very sorry for the delay in getting back to you. Following up on your question, the same coordinates output different sequences for the consensus fasta and bowtie fasta. The bowtie fasta was made using the bowtie index for the consensus fasta, so it was peculiar why they would not be equivalent.