Converting Gvcf Files Into Vcf
5
6
Entering edit mode
11.2 years ago
always_learning ★ 1.1k

Hi All,

Can some one help me in knowing algorithms for converting gVCF file for VCF file. As per https://sites.google.com/site/gvcftools/home/about-gvcf website, it can be done by removing non-vraiants regions from gVCF files but I am just wondering how to accomplish it means how to remove this ? Where to get coordinates for NON Variants region . I can write the scripts if once I know about how to accomplish it.

Thanks

vcf • 28k views
ADD COMMENT
8
Entering edit mode
11.1 years ago

To second @BruceyB -- you can download the latest gvcftools code here:

https://sites.google.com/site/gvcftools/home/download

As of this post the newest version is v0.15. There's a tool called "extract_variants" in this package which can quickly convert the file to a standard variants-only VCF, for example:

gzip -dc sample.genome.vcf.gz | extract_variants | bgzip -c > sample.vcf.gz

You can also extract a bed of the variant regions:

gzip -dc sample.genome.vcf.gz | get_called_regions > sample.variants.bed

You could whip up a script for either of these functions, but the gvcftools utilities can be faster/more convenient if you're dealing with a large number of samples. The code is also MIT licensed on github so you're welcome to fork and customize.

Several other gvcf utilities in the package are described here: https://sites.google.com/site/gvcftools/home/configuration-and-analysis

ADD COMMENT
0
Entering edit mode

Hi Chris, I was using extract_variants but its taking much time as compare to "grep -v 'BLOCKAVG' input.gvcf > output.vcf . So I am just wondering that if I may miss some rows if I used grep option or both are equally fine ?

Thanks Syed

ADD REPLY
1
Entering edit mode

Not all non-variant sites are converted to blocks -- you can have non-variants of length 1 which won't have a BLOCKAVG tag. If this matters to you, then the extract_variants option may be easier.

ADD REPLY
0
Entering edit mode

Thanks for responding to this. @syednajeebashraf: FYI Chris is the author of the gvcftools suite.

ADD REPLY
0
Entering edit mode

Thanks Chris, Ok !! I have already added grep version to my pipeline. But will try with extract_variants again. Will get back to you if need any help.

ADD REPLY
0
Entering edit mode

The only thing the grep command is doing is removing any lines that contain "BLOCKAVG". These lines are invariant regions. The remaining lines should be variant in at least one sample, otherwise they would be part of the block. The grep command is fast because it's not really doing much, where gvcftools is parsing each line in the file in a much more sophisticated way. I don't it's indicative of missing anything, but if you want to check the gvcftools output against grep, just run wc -l < file.vcf and make sure both files have the same number of lines.

ADD REPLY
0
Entering edit mode

Yes !! In fact I tried to use "extract_variants" but it's taking a large amount of time and in fact Its been 12 hour after running that and its still running. No doubt grep is much faster and easier also.

ADD REPLY
0
Entering edit mode

This doesn't sound right. The tool works as a filter so it might be waiting for stdin from you. If you're seeing it actively running for more than a few minutes (eg. by checking in top), then please report the issue if possible.

ADD REPLY
0
Entering edit mode

I tried running the file get_called_regions, I am getting nothing means 0kb file as ab output...I have checked it and repeat it many times still I ma not getting any output. Need some suggestions..and please add proper documentation also. Thanks

ADD REPLY
5
Entering edit mode
11.2 years ago

grep -v 'BLOCKAVG' input.gvcf > output.vcf

ADD COMMENT
0
Entering edit mode

I tried your command but surprisingly I got the vcf file of same size and same number of lines means no filtration. When I check the .g.vcf file, I found that there is not keyword like 'BLOCKAVG' throughout the file. Why so? I generated the .g.vcf file in the same way as mentioned in gatk haplotype documentation. Need some suggestions. Thanks.

ADD REPLY
0
Entering edit mode

I would try Chris's suggestion, which should really be the accepted answer.

ADD REPLY
2
Entering edit mode
11.2 years ago
Adam ★ 1.0k

You could use vcftools to strip out sites without at least 2 alleles.

vcftools --gzvcf <vcf_file> --min-alleles 2 --recode-to-stream --recode-INFO-all | bgzip -c > output.vcf.gz
ADD COMMENT
1
Entering edit mode
11.2 years ago
BruceB ▴ 340

gVCF files are valid VCF files, so you should be able to use them in the same way as a normal VCF file.

Perhaps you could post an example (or link to) the files which you are using, so we can see in more detail exactly how they vary from standard VCF files.

ADD COMMENT
0
Entering edit mode

You may get an example on the link provided in questions. gVCF files have additional information for non-vraiants regions also. I want to remove these information.

ADD REPLY
0
Entering edit mode

You could do it in Perl. Loop through line-by-line outputting only the lines where the alternative allele is not '.'. Probably not the most elegant solution, but it would work nevertheless.

ADD REPLY
0
Entering edit mode

Yes, It is but I am not sure still on this !! I guess their must be some more bullet proof way for this.

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Can you please elaborate or provide some reference for it. I am new to coding.

ADD REPLY
0
Entering edit mode
2.2 years ago
Shicheng Guo ★ 9.5k

gvcftools or glnexus will be the best choice. Here is the solution from glnexus

dx run workflow-glnexus -i common.gvcf_manifest=<manifest_file_id> -i common.config=gatk_unfiltered -i common.targets_bed=<bed_target_ranges> -i unify.shards_bed=<bed_genomic_partition_ranges> -i etl.shards=<num_sample_partitions>
ADD COMMENT

Login before adding your answer.

Traffic: 2650 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6