There are many different types of vcf files mainly: ExAC.r0.3.nonTCGA.sites.vep.vcf.gz, ExAC.r0.3.nonpsych.sites.vcf.gz, ExAC.r0.3.sites.vep.vcf.gz and Axiom_Exome_Plus.genotypes.all_populations.poly.vcf.gz.
What is the difference between all these files?
Can I say that nonTCGA.sites.vep.vcf.gz contains healthy individuals?
Is ExAC.r0.3.sites.vep.vcf.gz file a consolidated file?
In an FTP directory with releases (not specific to ExAc, this is a general thing), "current" is essentially always a pointer that points to the most recent version.
I believe the ExAc site has a description of this somewhere but the samples come from a range of cohorts. They have some description of them and the data sources on their website in the FAQ section. Many of the individuals do come froma disease cohort, but typically not rare pediatric-onset Mendelian disease cohorts. The TCGA data that is included is germline sequencing (not tumour) but because of the scale of the data they acknowledge that sample/label swaps are always possible but it shouldn't be a major issue. The non-TCGA VCF file has these samples removed. The non-psych has quite a few samples removed from the Psych cohorts you can see listed in the FAQs at the bottom. So you can't consider the cohorts "healthy" per se, but for most purposes they are suitable "healthy" controls. Yes, the ExAC.r0.3.sites.vep.vcf.gz should be all of the data.
ADD COMMENT
• link
updated 2.3 years ago by
Ram
44k
•
written 8.9 years ago by
DG
7.3k
0
Entering edit mode
Thanks a lot for your reply!!!
It is of great help to me.
I have gone through the tables containing cohorts.
Is it possible to subset the variants based on cohort? In 1000genome data I have sub-set the data based on subject ids for specific population using vcf tools. Is it possible to filter or subset ExAC data based on cohort? I think they are not providing a panel file similar to 1000 genome.
No Problem Amruta. As a suggestion this should be a comment on my answer, as opposed to submitted to another answer. The short answer would be that I'm not sure, but I don't think it is that easy at the moment to do. The 1000 Genomes data was collected purely for population study purposes, and other than what ethnic population a sample comes from, and the relatedness info in the case of trios that were recruited, that is all the info we have. For the ExAc samples they come from disease cohorts and there are usually tighter restrictions on linking phenotypes to samples.
You would really need to contact the Consortium I think if you have any specific requests. They might be able to accommodate you and make specific VCFs of subsetted data.
ADD REPLY
• link
updated 4.9 years ago by
Ram
44k
•
written 8.9 years ago by
DG
7.3k
Thanks a lot for your reply!!!
It is of great help to me.
I have gone through the tables containing cohorts.
Is it possible to subset the variants based on cohort? In 1000genome data I have sub-set the data based on subject ids for specific population using vcf tools. Is it possible to filter or subset ExAC data based on cohort? I think they are not providing a panel file similar to 1000 genome.
Thanks,
Amruta Nambiar
No Problem Amruta. As a suggestion this should be a comment on my answer, as opposed to submitted to another answer. The short answer would be that I'm not sure, but I don't think it is that easy at the moment to do. The 1000 Genomes data was collected purely for population study purposes, and other than what ethnic population a sample comes from, and the relatedness info in the case of trios that were recruited, that is all the info we have. For the ExAc samples they come from disease cohorts and there are usually tighter restrictions on linking phenotypes to samples.
You would really need to contact the Consortium I think if you have any specific requests. They might be able to accommodate you and make specific VCFs of subsetted data.
Thanks once again for your reply!!