Hello,
I'm trying to help a colleague who is trying to add ClinVar databases clinical significance column to VCF samples that she analysed. More specifically, we are trying to add overlapping/common variant annotation so that if the variant exist in the patient and ClinVar, we want the annotation to be carried into the new VCF/BED file output. We tried using Google Bard's and ChatGPT solutions which mainly rely on bedtools intersect but we had problems. I think it is possible, shouldn't be that complicated but we aren't bash terminal experts. Could anyone think of the code that would work for the siutation above using the bash terminal? It should just be a simple case of overlap to carryover the corresponding annotation from the ClinVar VCF to the patient VCF. The VCF format we are using is VCF 4.2.
Here's the link for ClinVar HG19 - wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar.vcf.gz
Thank you
-C.J
Consider looking into "annotating" a VCF or varient effect predictions. doing a bedtools intersect of your patient VCF with a Clinvar VCF isn't a bad approach necessarily, but there are purpose built tools for this that will likely give better results. Example thread Is there a way to annotate existing VCF file with known disease-causing mutations?
Example tools: VEP, Annovar, SNPEff, etc.
In my experience, ChatGPT will give you a better starting point with a more vague initial idea (compared to Google search) but is in no way a place for inexperienced people to start doing serious sensitive work. Please do not use ChatGPT unless you can verify everything it says/recommends. Use it to refresh your memory or give you minor places to start, but DO NOT rely on it.
For example, with my experience, I was able to frame the right question for ChatGPT: "How can I annotate a VCF using data from the ClinVar VCF?" And here is what it says:
And when I asked it if it could show me an example using bcftools, it gave me this pretty good code (pretty close to Pierre's excellent solution that covers all bases). You may want to limit which
INFO
attributes are carried over in the-c
parameter tobcftools annotate
and use--pair-logic
like he does: