I have a lot of bam files for which i need to call the genotype for, to create bcf files for and then run Plink. I am using two methods. In this is example i have 5 individuals, each bam file has its own RG tag, so they are distuingishable.
Method 1) I am for each individual sample creating the bcf file -> then merging these bcf files -> running plink on merged bcf
step 1) bcftools mpileup -I -f hg19.fa Individual_1.bam | bcftools call -c -Ob -o output_1.bcf (this i repeat for all 5 individuals)
step 2) bcftools merge -l bcfset.txt -Ob -o merge.bcf (with bcfset.txt containing the names of the individual bcf files)
step 3) bcftools view -Ob -m2 -M2 -v snps merge.bcf --regions chr21 -o filter.bcf (filter to get sites on chromosome 21)
With the final output being for a single line
chr21 9411618 . C T 57.9723 . VDB=0.0221621;SGB=-0.511536;MQ0F=0;AF1=1;AC1=2;MQ=36;FQ=-35.9869;DP=4;DP4=0,0,3,0 GT:PL ./.:. ./.:. ./.:. 1/1:90,9,0 ./.:.
step 4) Running plink -> which works!
Method 2) I am creating the pileup file by passing in a list with the bam files
step 1) bcftools mpileup -b bamfile.txt -I -f hg19.fa -Ob -o pileupfile.bcf
step 2) I have tested two methods to call
2A) bcftools call --threads 40 -m pileupfile.bcf -v -Ob -o callfile1.bcf (which i dont do filter as method 1 step 3)
chr21 9411618 . C T 65 . DP=4;VDB=0.0221621;SGB=0.823243;MQ0F=0;AC=2;AN=2;DP4=0,0,3,0;MQ=36 GT:PL ./.:0,0,0 ./.:0,0,0 ./.:0,0,0 1/1:90,9,0 ./.:0,0,0
HERE PLINK FAILS
2B) bcftools call -c -Ob pileupfile.bcf -o callfile2.bcf (similar call and filter as in method 1 using the -c option in the call and similar filtering as step 3)
chr21 9411618 . C T 57.9785 . DP=4;VDB=0.0221621;SGB=0.823243;MQ0F=0;AF1=1;AC1=10;DP4=0,0,3,0;MQ=36;FQ=-26.8149 GT:PL 0/1:0,0,0 0/1:0,0,0 0/1:0,0,0 1/1:90,9,0 0/1:0,0,0
HERE PLINK ALSO FAILS
When running plink the issues are with the missing genotypes, so since it works in method 1 where the information is ./.:. i think the issue is for method 2A and 2B that the missing genotypes are not represented either with ./.:0,0,0 or 0/1:0,0,0.
QA) Can anyone help me with either replacing the '0,0,0' to just a simple '.' without destroying the bcf format or in some way suggest how i can get the same output format for the 2A or 2B method as i get in method 1. I have looked into bcftools filter and annotate but i cant really figure out how to solve this specific issue.
Please post at least one full .log file showing HOW plink is failing.
PLINK v1.90b4.4 64-bit (21 May 2017) Options in effect: --allow-extra-chr --bcf head5callmv_21.bcf --const-fid --geno 0 --keep-allele-order --make-bed --out head5mv21 --recode --set-missing-var-ids @:#$1,$2 --vcf-idspace-to _
Hostname:
Working directory: Start time: Wed Mar 17 17:53:27 2021
Random number seed: 1616000007 772425 MB RAM detected; reserving 386212 MB for main workspace. --bcf: head5mv21-temporary.bed + head5mv21-temporary.bim + head5mv21-temporary.fam written. Warning: 224780 variant records had no GT field. 224780 variants loaded from .bim file. 224780 missing IDs set. 5 people (0 males, 0 females, 5 ambiguous) loaded from .fam. Ambiguous sex IDs written to head5mv21.nosex . Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 5 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0. Error: All variants excluded due to missing genotype data (--geno).
End time: Wed Mar 17 17:53:27 2021
There was a --bcf bugfix in February 2020; you should update your plink 1.9 build. (You can also use plink 2.0's --bcf, which should be more efficient.)