Question

ExAC variant data

1

Entering edit mode

8.9 years ago

amruta.bn ▴ 10

Hi ,

I am Bioinformatics enthusiast from India.It will be really helpful if you can answer below questions.

I have downloaded the vcf files from following link: ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3/. I would like to know whether its the latest version? I can see a version named current (ftp://ftp.broadinstitute.org/pub/ExAC_release/current). Is current is the latest one?
There are many different types of vcf files mainly: ExAC.r0.3.nonTCGA.sites.vep.vcf.gz, ExAC.r0.3.nonpsych.sites.vcf.gz, ExAC.r0.3.sites.vep.vcf.gz and Axiom_Exome_Plus.genotypes.all_populations.poly.vcf.gz.
- What is the difference between all these files?
- Can I say that nonTCGA.sites.vep.vcf.gz contains healthy individuals?
- Is ExAC.r0.3.sites.vep.vcf.gz file a consolidated file?
- What type of data the other two files contain?
- Do you have a variant count for all these files?

Thanks,
Amruta Nambiar

snp sequencing • 5.3k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.9 years ago by amruta.bn ▴ 10

Ram · Answer 1 · 2016-01-07

2

Entering edit mode

8.9 years ago

DG 7.3k

Hi Amruta,

In an FTP directory with releases (not specific to ExAc, this is a general thing), "current" is essentially always a pointer that points to the most recent version.
I believe the ExAc site has a description of this somewhere but the samples come from a range of cohorts. They have some description of them and the data sources on their website in the FAQ section. Many of the individuals do come froma disease cohort, but typically not rare pediatric-onset Mendelian disease cohorts. The TCGA data that is included is germline sequencing (not tumour) but because of the scale of the data they acknowledge that sample/label swaps are always possible but it shouldn't be a major issue. The non-TCGA VCF file has these samples removed. The non-psych has quite a few samples removed from the Psych cohorts you can see listed in the FAQs at the bottom. So you can't consider the cohorts "healthy" per se, but for most purposes they are suitable "healthy" controls. Yes, the ExAC.r0.3.sites.vep.vcf.gz should be all of the data.

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.9 years ago by DG 7.3k

0

Entering edit mode

Thanks a lot for your reply!!!

It is of great help to me.

I have gone through the tables containing cohorts.

Is it possible to subset the variants based on cohort? In 1000genome data I have sub-set the data based on subject ids for specific population using vcf tools. Is it possible to filter or subset ExAC data based on cohort? I think they are not providing a panel file similar to 1000 genome.

Thanks,
Amruta Nambiar

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by amruta.nambiar • 0

0

Entering edit mode

No Problem Amruta. As a suggestion this should be a comment on my answer, as opposed to submitted to another answer. The short answer would be that I'm not sure, but I don't think it is that easy at the moment to do. The 1000 Genomes data was collected purely for population study purposes, and other than what ethnic population a sample comes from, and the relatedness info in the case of trios that were recruited, that is all the info we have. For the ExAc samples they come from disease cohorts and there are usually tighter restrictions on linking phenotypes to samples.

You would really need to contact the Consortium I think if you have any specific requests. They might be able to accommodate you and make specific VCFs of subsetted data.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by DG 7.3k

0

Entering edit mode

Thanks once again for your reply!!

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by amruta.nambiar • 0