I first apologise if these are obvious questions to many, but it's unclear to me the following:
- The process of variant calling does not necessarily retrieve all variants in the genome? I am asking this because I have tested some sample data from WES. I used the VCFs files provided in a website. I am interested in only a portion of all the genome's variants (200k). When I filter those by rsid in the WES VCFs, only very few are retrieved (15k).
- Does the number of retrieved variants depend on the coverage? That means, the higher the coverage, the higher chance I'll get the variants I am interested in sequenced.
- I am also thinking that there could be naming issues between the rsid in the VCFs and my rsids?. Is therefore better to subset the VCFs based on positions?
Thanks a lot for your time.
Thank you very much! This has been helpful.
May I ask which tools would be most efficient for filtering desired variants? I was thinking bedtools but I am not too sure, I have never used it with VCFs
bcftools , gatk SelectVariants, ...