Making a vcf file from a subset of regions from gvcf files
1
0
Entering edit mode
7.4 years ago
Floris Brenk ★ 1.0k

Dear all,

We have a large exome sequencing (>8000) cohort and recently processed them. Now at the end of the line there are some weird results for a few genes. So I would like to recreate some of the final vcf files (all samples combined). However doing this for all samples will take really long and requires high computational power. I was wondering would there be any objections or biases introduced by extracting just some genes of interest (like 50) from each gvcf and then continue with those subsetted gvcf files to speed everything up? Or does GATK steps require the whole gvcf present? Any other recommendations for in between steps are welcome :)

gatk vcf • 3.1k views
ADD COMMENT
1
Entering edit mode
7.4 years ago
aays ▴ 180

I think the best way to go is to indeed feed in the entire gVCFs as input, but then specify specific intervals to include or exclude in your GenotypeGVCFs command. My understanding is that this will precompute what the actual desired intervals are and then only process those (I may be wrong about this, but I've noticed substantial differences in speed when doing it myself)

If you'd like to specify certain regions, create a flat file with the file extension .intervals and feed it to the -L argument in your GATK command. An .intervals file (let's say this is called myregions.intervals) looks like this:

chromosome_1
chromosome_2:1-100

The above would make GATK only process the entirety of chromosome_1 and positions 1-100 from chromosome_2. Further documentation can be found here. There is also an -XL argument if you'd like to exclude a specified set of intervals, if that's the easier way to go.

A sample command would look like this:

java -jar GenomeAnalysisTK.jar \
   -T GenotypeGVCFs \
   -R reference.fasta \
   -L myregions.intervals \
   --variant sample1.g.vcf \
   --variant sample2.g.vcf \
   -o output.vcf
ADD COMMENT

Login before adding your answer.

Traffic: 1802 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6