Hi dear community,
I don't have any experience in variant calling, and I have to solve this problem:
Using the most recent VCF file describing ClinVar variants and a bed/gff file of the coding sequence of curated RefSeq genes, write a script that outputs all the pathogenic and likely pathogenic variants that are found inside genes and have coverage less than 10x in the BAM file. the script should output a table with each ClinVar variant’s chromosome, genomic position, reference and alternate alleles, coverage in the BAM file and all the RefSeq transcripts that are affected by the variant.
I have an access to a basespace project which contains analysis (vcf, bam...) of some biosamples (s01-NFE-CEX-NA12878-demo...)
And I really don't know how to start to solve this problem
I will glad to get some help
Thank you very much
Is there a reason why you are not willing to give it a try in the first instance?
I don't know what to do, for instance how to get the most recent VCF file describing ClinVar variants and a bed/gff file of the coding sequence of curated RefSeq genes I am new in the field
You can find
ClinVar
VCF data here, you probably want the GRCh38 build : https://ftp.ncbi.nlm.nih.gov/pub/clinvar/GFF file for GRCh38 genome build is here: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.26_GRCh38/GCF_000001405.26_GRCh38_genomic.gff.gz
You can convert the coding regions from GRCh38 to BED format using: all coding regions .bed file hg38 Whole Genome Sequencing
Thank you very much! I have my vcf and gff files, and converted them to dataframe. I have a Bam file from an analysis. But I don't know how to link them together in order to find pathogenic variant. I need to find pathogenic variants that are found inside genes and have coverage less than 10x in the BAM file.
Do you have an idea?
Thank you
Thank you very much for the help!