Dear All,
I have a problem with bcftools roh and hope you can guide me to what may be causing this:
I run Example-1 of the Github: http://samtools.github.io/bcftools/howtos/roh-calling.html and see the output consists of a state (0:HW, 1:AZ) , and quality for each position that was in the original vcf file (test.vcf.gz).
I ham a bam file and like to get the exact same-format output for my list of positions. Here is how I generate the original .vcf (just in case that the way I am generating this is causing the issue that I explain later).
samtools mpileup -uf $FASTA ./myfile.bam -l ./dbSNP_chr9.bed | bcftools call -m -O z > roh.test.vcf.gz
This will provide a .vcf for 309 positions that I queried, and the header/rest of the vcf looks fine:
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
...
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callVersion=1.3.1+htslib-1.3.1
##bcftools_callCommand=call -m -O z
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT mysample
9 421981 . G . 0 . DP=470;SGB=-0.379885;RPB=1;MQB=1;MQSB=1;BQB=1;MQ0F=0;AN=2;DP4=208,196,0,1;MQ=60 GT 0/0
9 422026 . T . 0 . DP=405;MQSB=1;MQ0F=0;AN=2;DP4=148,216,0,0;MQ=60 GT 0/0
9 3206113 . A . 0 . DP=338;MQSB=1;MQ0F=0;AN=2;DP4=219,79,0,0;MQ=60 GT 0/0
...
But when I try the command in the example, I only receive two lines (positions) in the output file as also noted in the stdout!
bcftools roh --AF-dflt 0.4 ./roh.test.vcf.gz
# This file was produced by: bcftools roh(1.3.1+htslib-1.3.1)
# The command line was: bcftools roh --AF-dflt 0.4 ./roh.test.vcf.gz
#
# [1]Chromosome [2]Position [3]State (0:HW, 1:AZ) [4]Quality
9 5009541 1 3.0
9 5125250 1 3.0
Number of lines: total/processed: 309/2
My .vcf has PL but I also tried skipping the PL scores with -G30 (as is in the example), just in case it had something to do with this but the answer was exactly the same.
I greatly appreciate if you could point me to what may be causing this.
Thank You,
nfarnoud
Yes, it's puzzling. No error. But as noted in the stdout (last line of the block above) it says 2 (out of 309) lines are processed with no further explanation:
Number of lines: total/processed: 309/2
The genotypes that you have posted are each 0/0, i.e., homozygous reference. How are the remaining genotypes in your input file? Why are there only 309 variants? Are they scattered genome-wide? Note that, for ROH to do its work, the variants have to be a certain distance apart. For example, one cannot detect ROH with accuracy by processing 3 variants that span a region of 1 megabase.
The distance between variants is something I did not take into consideration! The reason that there are 309 variants is that I wanted to query LOH only on a specific chromosome (chr 9) and also I am working with a targeted assay. So what I did was finding all dbSNP snps that fall within my assay's .bed file and queried them using the 1st command (samtools + bcftools), and then used the resultant .vcf file as the input to roh command. About genotypes: most of the SNPs look homozygous in chr9 if we look only at GT field, however the percentage contribution of REF/ALT alleles for homozygous-called SNPs is still quite different (DP4) and that is why I thought the roh command may model those per-allele-frequencies and generate a different call. I may have not used the right approach to answer my question.
Yes, basically, you require a high density of genotypes in order to accurately make a ROH call. I have used exome-seq data in the past for ROH analysis. You may have to play around with the parameters.
Another option is to use PLINK's ROH implementation.
@kevin Thanks a lot. I came across that in searches, will give it try. Thanks again for your help and input!