Entering edit mode
6.0 years ago
bioinfo89
▴
60
Hi All,
I am working on 1000g data. So I have 25 tab-delimited text files corresponding to each population. Each file has jointly genotyped data, so it contains genotypes from all the samples (~60-120) per population in the VCF.
Format of the File:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19625 NA19703 NA19711 NA19818 NA19835 NA19904 NA19917 NA19922 NA19984 NA20127 NA20278 NA20287 NA20294 NA20299 NA20318 NA20322 NA20336 NA20341 NA20346 NA20356 NA20361 NA19700 NA19704 NA19712 NA19819 NA19900 NA19908 NA19914 NA19920 NA19923 NA19985 NA20281 NA20289 NA20296 NA20314 NA20320 NA20332 NA20339 NA20342 NA20351 NA20357 NA20362 NA19701 NA19707 NA19713 NA19834 NA19901 NA19909 NA19916 NA19921 NA19982 NA20126 NA20276 NA20282 NA20291 NA20298 NA20317 NA20321 NA20334 NA20340 NA20344 NA20355 NA20359 NA20412
chr 1234 . TT T . . VRT=2 GT 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./.
What I want to do is to create a single VCF file from all the 25 population VCFs which would list all the total unique sites combined including the shared sites among the total samples (of all populations).
Format I want:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19625 NA19703 NA19711 NA19818 NA19835 HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
chr 1234 . TT T . . VRT=2 GT 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0
Is there a way to do this?
Thank you!
Have you already looked into VCFtools? I believe their merging, comparing, and consensus options may be beneficial.
I tried it. But since the file I am using is not a standard VCF file, I am not getting the desired output. The vcf-validator throws lots of errors when I check the files I am using.
More details on what commands you are using and errors would be helpful. I am aware of another post with the same issue Merge individual vcf files. There is also another tool kit called vcflib (https://github.com/vcflib/vcflib) if you would care to test your data there.
Yes sure I will test the vcflib tool kit thanks for the info.
I used the following commands for vcftools:
Error:
Command to validate vcf:
Error:
Then just fix those errors ;-) you will make things lots easier if you follow the vcf specifications. Also bcftools could help you, but that tool is also quite strict about the vcf specfications.
Yes, I am trying my best. :)
Could you please tell us, what makes your vcf a non standard vcf file?
fin swimmer
By non-standard vcf I mean, it is a dbSNP submission VCF format which has additional information about the study and methods etc along with the reference assembly ID, INFO and FORMAT fields. Also, the INFO and FORMAT fields I had to remove since the tabidx step was not able to parse the information.
Command and Error:
Could you please post the complete header of the original vcf file and the first few variants?
Thanks.
fin swimmer
I shortened your title to make it readable.