Question

Why are some WES VCFs larger than others?

0

Entering edit mode

18 months ago

LayneSadler ▴ 90

I'm working with BAMs from TCGA. Some of these exomes are 10x larger than other files: 60GB vs 6GB. I assume that must be depth related.

However, the VCFs derived from these larger BAMs have 1.7M entries (300MB+). The exome itself only has ~ 64M positions (3.2B*0.02). That would mean every 37th nucleotide (64/1.7) is mutated, which doesn't make sense.

Maybe it's a high depth sample and the excess variants are low depth?
It's not a gvcf with site coverage for the entire cohort, I called the variants myself
Maybe it's something to do with mapping variants to multiple transcripts

WES sequencing genomics NGS exome • 2.1k views

ADD COMMENT • link 18 months ago by LayneSadler ▴ 90

0

Entering edit mode

"It's not a gvcf with site coverage for the entire cohort, I called the variants myself" This sentence doesn't make sense to me. gVCFs are generally single-sample. are your vcf(s) single-sample?

ADD REPLY • link 18 months ago by Jeremy Leipzig 22k

score 0 · Answer 1 · 2023-05-27

Dear Kermit Presumably, as these are Human exome data as Indicative of TCGA, a 100x or 119x depth of coverage would yield ideally 6GB data. It also depends on the exome capture one would use, for example Agilent Sureselect V8 with NCV and 5' UTRs would capture more than 68MB of exonic chunks and so ca. 6.8 GB of data is obtained for the said coverage. In your case, it is more.

In regards to BAM, it is again alignment and mapping to your reference. Your SAM is a direct alignment with your reference Index and ideally would be more than 3 times the size of your raw reads and BAM should in principle be 2/3rds or less of SAM

The VCF depends on how the variants are called and by which method and the calls, for example SNPs, Indels, CNVs etc.

I assume you have the data from service provider NOT the one analysed by yourself. If you could elaborate, will be happy to respond

Regards Prash

score 0 · Answer 2 · 2023-05-27

0

Entering edit mode

18 months ago

LayneSadler ▴ 90

vcftools has a critical bug: https://github.com/vcftools/vcftools/issues/134

I was running a depth-based QC filter for variants:

vcftools --vcf [in_file] --out [out_file] --minGQ 15 --minDP 20 --recode --recode-INFO-all

However, the minDP filter was silently not working
While 20 is great for a 30x BAM, it is probably too low for a 100x BAM

ADD COMMENT • link 18 months ago by LayneSadler ▴ 90

0

Entering edit mode

Ohh ok. Even otherwise, vcftools has a glitch. It isn't bam file as an input. It should be a bcf or vcd here.

Pl try using bcftools. Hope this helps!

ADD REPLY • link 18 months ago by Prash ▴ 280

0

Entering edit mode

https://github.com/vcftools/vcftools/issues/134

ADD REPLY • link 18 months ago by LayneSadler ▴ 90

score 0 · Answer 3 · 2023-05-27

Hi Kermit and prash (and others)!

Nice discussion. I wanted to offer some biological rationales in addition to the technical reasons that Prash has already mentioned.

Background:

First, note a VCF file is large in proportion to the number of differences it has relative to the reference genome(s) being used, because only a difference to the reference is recorded in a VCF and thus gets a new line, lengthening the file.

In a constitutional (germline) setting:

Since most are still using GRCh38, the reference is primarily a person of European ancestry. As such, relative to such a reference, persons of African or Asian heritage may have more genetic variants. This also applies to persons having genetic admixture, who tend to have more sites of heterozygosity on average than persons whose ancestry reflects one single population.

In a somatic setting:

Now, you mentioned TCGA, which is a cancer database. Depending on the driver mutations, the somatic mutation rate varies tremendously both within and across cancers. Figure 3 of alexandrov et al., 2020, illustrates this very well. In case that picture is inaccessible, the PMID is 32025018. When I say tremendously, it is not an exaggeration; on one extreme you have for instance Polymerase Q driven hypermutatory phenotype, the other extreme might be certain kinds of Oligodendroglioma, whose somatic mutation rate is most often low, rather, it is thought that epigenetic and other changes do much of the work in "turning the cells malignant" so to speak. The VCF file will, of course, grow in size in proportion to the number of genetic variants that present. Therefore, the VCF for a PolQ driven malignancies may be larger than many of the oligodendroglioma tumors, and also larger than a paired normal VCF file.

Returning to the prior answer; technical reasons:

More specifically, the VCF file will grow in proportion to the number of variants that can be called accurately, which typically means having an appreciable allele frequency. This returns us to Prash's ideas, relating to the many technical reasons why they differ in size. As already indicated, the higher the read depth of the genome (i.e., 10x, 100x, 1000x, 10000x) sequencing, the lower the variant allele fraction that can be called at a given level of confidence. This is important in a somatic setting, because intratumoral heterogeneity is in some cases predictive of prognosis, treatment failure, etc. Therefore, it can be important to capture the variants found in a subclone of a tumor even if, presently, it does not account for a large percentage of the cells in the tumor. Other important technical factors include, but are certainly not limited to: sample quality (which might vary depending on formalin fixation, age, etc), sample processing pipeline, and the tumor percentage of the sample (which directly relates to the VAF detection problem, above).

Big picture:

Of these factors, the somatic changes will tend to affect the size of the VCF more than the differences in germline variation relating to ancestry or what have you, because the somatic mutation rate can get so high. Also, of note, these factors apply not only to WES, but WGS or other designs as well.