Question

Exome sequencing dataset for SKAT

0

Entering edit mode

6 months ago

aurina.arnatk ▴ 10

Hello,

I would appreciate some feedback on some sequencing data I'm analysing with SKAT. After a bioinformatician finished processing the data, I was given a .vcf file and VEP annotation file in the tab format containing variants that PASS all filters in gatk.

The dataset contains 1194 subjects and 1 069 064 annotated variants (288547 variants with MAF<0.01). In addition to that, I further filtered variants based on annotations, resulting in:

84474 deleterious missense variants (SIFT != tolerated", "PolyPhen > 0.908); of those 5456 have MAF<0.01;
31611 loss of function variants (stop_gained, frameshift_variant, stop_lost, start_lost, splice_acceptor_variant splice_donor_variant); Of those 1178 have MAF<0.01;

When I create SNP sets for all available genes (combining both loss of function and deleterious missense variants) the number of variants in a gene ranges from only 1 to 50 with a median of 5.

It's my first time working with this type of data, but it seems that ~1mln variants in an exome sequencing dataset is relatively low. My concern is that the sets of deleterious missense variants and loss of function variants will not have enough rare variants for association testing. When running SKATO analyses using larger SNP sets, for example, selecting variants in genes associated with a disorder (ADHD in this case), this set contains 475 variants and of those only 370 are used for association testing (the rest don't show variation in the sample).

Could anyone please comment on the suitability of these data for the analysis with SKATO? Any idea if this dataset is ok and if not, which processing step should be investigated?

Thank you very much!

Kind regards, Aurina

SKAT VEP exome-sequencing • 426 views

ADD COMMENT • link 5 months ago by aurina.arnatk ▴ 10

score 0 · Answer 1 · 2024-05-30

0

Entering edit mode

6 months ago

Pierre Lindenbaum 164k

In the burden test, what's interesting is the number of samples case/controls having at least one rare variant in the gene. So one variant per gene is low, but you might find interesting things with the other genes. Just try.

ADD COMMENT • link 6 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks for the response, I'll have a go at gene-level analyses. Following up on the dataset - would you think that it is worth exploring the data with more lenient filtering criteria (rather than keeping variants that pass all filters)? If so, what could be the appropriate set of filters to expand the dataset while maintaining reasonable noise levels?

Any thoughts are really appreciated.

ADD REPLY • link 5 months ago by aurina.arnatk ▴ 10