bcftools merge is resulting in a lot of missing data, how do I fix this?
2
0
Entering edit mode
3.6 years ago
devenvyas ▴ 760

I had bcf files from imputations using Glimpse (https://odelaneau.github.io/GLIMPSE/). I converted the individual imputation files (22 per individual) into bgzip vcf files. The vcf.gz files have complete data as is expected for imputation.

I am trying to merge them so all individuals are in the same files (so going from n × 22 files down to 22 files). When I do this, a lot of data just go missing, and there is no longer complete data.

I am not sure what is going on. Each individual was imputed for the exact same sites, so I am very confused. Does anyone know how to fix this problem?

vcf bcf • 3.2k views
ADD COMMENT
1
Entering edit mode
3.6 years ago

I reckon you have different variant sites in your files. Individual A has SNPs at position 1, 2, 3, after imputation you'll still have SNPs at position 1, 2, 3. Individual B has SNPs at position 4, 5, 6, after imputation it's still 4, 5, 6. Once you merge them into one file, Individual A will have three missing alleles at position 4, 5, 6, individual B will have three missing SNPs at position 1, 2, 3. Compare the positions in your merged output files with your input files to see whether that's the case. If that's what's happening with your data there are two ways to fix this:

1) rerun the SNP-calling including all invariable and variable sites. In GATK that's -all-sites or -allSites, in bcftools call that means removing the -v flag (most tutorials have lines like bcftools call -mv -Ob -o calls.bcf, where -v means 'only report variable sites')

2) if you're sure that these sites aren't missing (may be impossible? they could be 0/0 - reference, they could be ./. - proper missing, maybe deleted, maybe low coverage) you can rerun bcftools merge using the -0 flag. In this case missing alleles are set to reference (0/0)

Edit: oh sorry, just saw the 'imputed for the exact same sites', are you sure the input files have all the same positions?

ADD COMMENT
0
Entering edit mode
14 months ago
r.shamsi • 0

"I had 40 *.vcf files from FreeBayes tools in Galaxy.

The files were filtered separately. I merged them into one file, *.vcf, which contains 6000 SNPs.

When I tried to launch a GWAS, a lot of data went missing, and the data is no longer complete.

I attempted to follow your instructions in the Unix Bash terminal:

1) rerun the SNP-calling including all invariable and variable sites. In GATK that's -all-sites or -allSites, in bcftools call that means removing the -v flag (most tutorials have lines like bcftools call -mv -Ob -o calls.bcf, where -v means 'only report variable sites')

2) if you're sure that these sites aren't missing (may be impossible? they could be 0/0 - reference, they could be ./. - proper missing, maybe deleted, maybe low coverage) you can rerun bcftools merge using the -0 flag. In this case missing alleles are set to reference (0/0). However, it didn't work. Do you have any advice, please?"

ADD COMMENT

Login before adding your answer.

Traffic: 1943 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6