Does anyone else use Freebayes for variant calling?
1
1
Entering edit mode
7.2 years ago

I have a bam file that I want to do some variant calling on. We had a company generate a VCF for use using this particular bam file but I'm not entirely sure it was done correctly. I want to see if I can "double check" their work using freebayes. I have freebayes working on my computer but I was wondering if anyone has a script that they use to generate their VCFs? I've been playing with the commands but I still have been getting random calls where they shouldn't be. It's partly because there is low coverage in some regions but I also think it may be too sensitive when it comes to calling variants.

I guess what I'm asking is if anyone has a script that I can mirror to try and generate a correctly variant called VCF?

Thanks!!

Assembly genome snp sequencing gene • 4.9k views
ADD COMMENT
0
Entering edit mode

did you check in the VCF header if the freebayes cmd-line is present ?

ADD REPLY
0
Entering edit mode

I'm not entirely sure how to do that?

ADD REPLY
0
Entering edit mode

Hello,

there is nothing special about using freebayes.

freebayes -f ref.fa aln.bam >var.vcf

This should be fine. Setting the defaults to other values should be avoided as long as you don't know what you are doing.

Removing variants that could be artefacts is a step after the variant calling. For this you could use for example bcftools.

fin swimmer

ADD REPLY
0
Entering edit mode

When I do that, pretty much just about every spot is marked as a variant. When I do something like --min-coverage 5, a lot of the random variants go away. It doesn't seem to be looking at the reads in the bam file to see if there is a variant?

ADD REPLY
0
Entering edit mode

Are you sure you used the same reference sequenc for variant calling as the one you used for aligment?

Please post the output of samtools view -H aln.bam and the first lines of the results in your vcf file including the header.

fin swimmer

ADD REPLY
0
Entering edit mode

I was actually using the wrong reference but they two weren't that different. I was making test VCFs that were about 1000 bp in size so it really wouldn't have made a difference. I changed the reference to the correct one and still have the same issue. I would say a good 200-300 of the 1000 bp are misscalled.

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20171004
##source=freeBayes v1.1.0-46-g8d2b3a0
##reference=felCat8.fa
ADD REPLY
0
Entering edit mode

Sorry if I wasn't clear enough. We need the whole header and some example of variants.

fin swimmer

ADD REPLY
0
Entering edit mode
$ ./bcftools view -h persian2.vcf
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20171004
##source=freeBayes v1.1.0-46-g8d2b3a0
##reference=felCat8.fa
##contig=<ID=chrA1,length=239302903>
##contig=<ID=chrA1_JH408313_random,length=44248>
##contig=<ID=chrA1_JH408314_random,length=44008>
##contig=<ID=chrA1_JH408315_random,length=7119>
##contig=<ID=chrA1_JH408316_random,length=31597>
##contig=<ID=chrA1_JH408317_random,length=18869>
##contig=<ID=chrA1_JH408318_random,length=22488>
##contig=<ID=chrA1_JH408319_random,length=36565>
##contig=<ID=chrA1_JH408320_random,length=406983>
##contig=<ID=chrA1_JH408321_random,length=28486>
......
##phasing=none
##commandline="./freebayes --fasta-reference felCat8.fa --region chrA1:1-1000 --use-duplicate-reads C02036.bam --vcf persian2.vcf"
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
##INFO=<ID=DPB,Number=1,Type=Float,Description="Total read depth per bp at the locus; bases in reads overlapping / bases in haplotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=RO,Number=1,Type=Integer,Description="Count of full observations of the reference haplotype.">
##INFO=<ID=AO,Number=A,Type=Integer,Description="Count of full observations of this alternate haplotype.">
......

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Number of observation for each allele">
##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">
##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations">
##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">
##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations">
......
ADD REPLY
0
Entering edit mode

Hello,

there is still some example in your vcf file missing. Without it is quiet difficult to help. And what does this ...... mean? What have you truncated there?

Did you already take a look in igv on those position?

As you can see in your post there is this line

##commandline="./freebayes --fasta-reference felCat8.fa --region chrA1:1-1000 --use-duplicate-reads C02036.bam --vcf persian2.vcf"

In the vcf file of your computer company should be something similar. There you can see what programm they use and with what parameters. This is what Pierre ask for.

Why do you use --use-duplicate-reads ?

fin swimmer

ADD REPLY
0
Entering edit mode

I have the same problem 4 years later. The first time I run it, it apparently works fine, but then it stops working for the same data forever, even if I delete all generated files (including indexes).

ADD REPLY
0
Entering edit mode
7.2 years ago

I use Freebayes a lot for variants. It is good, especially the multisample parallel version.

It should not "generate variants for almost every position". It might if the data are very highly variable (not Illumina? ) or quality bad or the reference seq might be bad. Also, the genotype under study might be a long way from the reference genotype.

Keep in mind that you should read the Freebayes info on github, and in particular remove low quality genotype calls ( as per the docs, I think score <100). This removes a lot of noise. Thereafter look at vcflib or vt for filtering.

Best wishes, Colin

ADD COMMENT

Login before adding your answer.

Traffic: 2152 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6