Problems with consensus fasta
0
0
Entering edit mode
3.0 years ago

Hi everyone!

I'm new in bioinformatics and also in informatics so I'm struggling a bit trying to learn on my own.

At the moment I'm having some difficulties trying to generate a fasta file from a bam file of a complete human genome.

I've red on the internet that a way to obtain what I'm trying to get is to follow this command:

call variants

bcftools mpileup -Ou -f reference.fa alignments.bam | bcftools call -mv -Oz -o calls.vcf.gz
bcftools index calls.vcf.gz

normalize indels

bcftools norm -f reference.fa calls.vcf.gz -Ob -o calls.norm.bcf

filter adjacent indels within 5bp

bcftools filter --IndelGap 5 calls.norm.bcf -Ob -o calls.norm.flt-indels.bcf

apply variants to create consensus sequence

cat reference.fa | bcftools consensus -m lowcoveragesites.bed calls.vcf.gz > consensus.fa

1) Let me start by saying that I find a bit weird that, according to the command above, I have to generate normalized and filtered files (call.norm.bcf, call.norm.filtr-indels.bcf) but never use them to create the consensus. Is there something wrong that the command?

But that being said, I'm following step by step the command.

2) I'm having some issues with the overlapping variants.

I constantly receive the error message:

The site 3:60771641 overlaps with another variant, skipping...

for different position in my genome. Is there a way to solution it?

3) Moreover, I was wondering if the bcftools consensus was considering the Heterozygous alleles inserting at the specific position the letters for snp, according to the IUPAC, so I started searching for R,W,Y,... in my consensus genome and I only found 3 heterozygous sites with make me think something went wrong.

Can you help me with my issue? Do you know how I can easily get a consensus fasta from my bam?

Thank you!

bed bcftools samtools • 1.5k views
ADD COMMENT
0
Entering edit mode

Can you please help me with my issue?

ADD REPLY
0
Entering edit mode

Can you also please have some patience.

Everyone here is doing this on voluntarily basis (as in: we sacrifice our free-time to pro-bono help out others)

ADD REPLY
0
Entering edit mode

Of course, sorry for the misunderstanding, I was just trying to edit my post but I ended commenting myself

ADD REPLY
0
Entering edit mode

you are asking too many nebulous questions, tools are not designed to cover every corner case,

1) what is the question there? That you find something weird?

2) if you have an overlapping variant, you get that error, the tool cannot resolve that variant, perhaps you could filter out the overlapping variant from the file

3) again, what is the question? it is not clear

ADD REPLY
0
Entering edit mode

thank you for your reply, I'll try to be more clear.

My first question is " is the command I found correct? or should I use the normalized-filtered file when I have to apply the variants ?

So that my last command won't be this:

cat reference.fa | bcftools consensus -m lowcoveragesites.bed calls.vcf.gz > consensus.fa

but this:

cat reference.fa | bcftools consensus -m lowcoveragesites.bed calls_norm_filt-indels.vcf.gz > consensus.fa

Than I'm wondering what can be the reason for so many overlapping variants? Is there something wrong with the command that can make you think that the problem is with my pipeline or is the pipeline correct and probably there is something wrong with the data? Or maybe everything is ok , it is normal to have lots of overlapping variant!?

And finally, do you recon it is normal that I obtain a consensus file for a whole human genome with only 3 heterozygous positions?

Sorry for so many questions, but I'm not an expert, I'm just a student trying to learn on my own and sometimes I face different doubts and I don't have anybody to ask. So really thank you for your help end time.

ADD REPLY

Login before adding your answer.

Traffic: 2291 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6