Hi everyone!
I'm new in bioinformatics and also in informatics so I'm struggling a bit trying to learn on my own.
At the moment I'm having some difficulties trying to generate a fasta file from a bam file of a complete human genome.
I've red on the internet that a way to obtain what I'm trying to get is to follow this command:
call variants
bcftools mpileup -Ou -f reference.fa alignments.bam | bcftools call -mv -Oz -o calls.vcf.gz
bcftools index calls.vcf.gz
normalize indels
bcftools norm -f reference.fa calls.vcf.gz -Ob -o calls.norm.bcf
filter adjacent indels within 5bp
bcftools filter --IndelGap 5 calls.norm.bcf -Ob -o calls.norm.flt-indels.bcf
apply variants to create consensus sequence
cat reference.fa | bcftools consensus -m lowcoveragesites.bed calls.vcf.gz > consensus.fa
1) Let me start by saying that I find a bit weird that, according to the command above, I have to generate normalized and filtered files (call.norm.bcf, call.norm.filtr-indels.bcf) but never use them to create the consensus. Is there something wrong that the command?
But that being said, I'm following step by step the command.
2) I'm having some issues with the overlapping variants.
I constantly receive the error message:
The site 3:60771641 overlaps with another variant, skipping...
for different position in my genome. Is there a way to solution it?
3) Moreover, I was wondering if the bcftools consensus was considering the Heterozygous alleles inserting at the specific position the letters for snp, according to the IUPAC, so I started searching for R,W,Y,... in my consensus genome and I only found 3 heterozygous sites with make me think something went wrong.
Can you help me with my issue? Do you know how I can easily get a consensus fasta from my bam?
Thank you!
Can you please help me with my issue?
Can you also please have some patience.
Everyone here is doing this on voluntarily basis (as in: we sacrifice our free-time to pro-bono help out others)
Of course, sorry for the misunderstanding, I was just trying to edit my post but I ended commenting myself
you are asking too many nebulous questions, tools are not designed to cover every corner case,
1) what is the question there? That you find something weird?
2) if you have an overlapping variant, you get that error, the tool cannot resolve that variant, perhaps you could filter out the overlapping variant from the file
3) again, what is the question? it is not clear
thank you for your reply, I'll try to be more clear.
My first question is " is the command I found correct? or should I use the normalized-filtered file when I have to apply the variants ?
So that my last command won't be this:
but this:
Than I'm wondering what can be the reason for so many overlapping variants? Is there something wrong with the command that can make you think that the problem is with my pipeline or is the pipeline correct and probably there is something wrong with the data? Or maybe everything is ok , it is normal to have lots of overlapping variant!?
And finally, do you recon it is normal that I obtain a consensus file for a whole human genome with only 3 heterozygous positions?
Sorry for so many questions, but I'm not an expert, I'm just a student trying to learn on my own and sometimes I face different doubts and I don't have anybody to ask. So really thank you for your help end time.