Hello, I am currently working on optimizng a variant calling pipeline for short read RNA-Seq data, and i have been searching for any Gold Standard benchmarking datasets for the pipeline that has the VCF results provided and could not find any.
I know GIAB project provides Google-Illumina short read RNA-Seq datasets, but there is no curated VCF for the data that i can compare my final results with, so if anyone has an idea of what i can do it would be really helpful.
Thank you all in advance.
As long as it is the same GIAB sample you could compare your SNP with the SNP's available for the whole genome set.
Thank you so much for answering! I actually found some studies doing it the way you mentioned.
I ran the GATK best practices pipeline on the RNA-Seq reads and compared it to the high confidence variants using hap.py, but the results do not make sense as it gave F1 Scores of about 0.04, which indicates i am doing something wrong in my analysis.
I tried every troubleshoot i could think of like checking my references, tools parameters, etc.., but could not grasp the cause of the problem, do you have any idea of what i could be doing wrong?