Hello, I have previously posted that i was trying to find a benchmark RNA-Seq dataset for variant calling of SNPs and Indels. Someone recommended using the GIAB RNA-Seq reads and comparing it with the truth set of the same sample.
I tried running the GATK Best practices pipeline for short variant discovery on the data and comparing my results with theirs using hap.py, but got very unexpected bad results of F1 Scores 0.04, which to me indicates that i seem to be doing something wrong.
I tried troubleshooting by re-checking my references, tool parameters, even running the variant calling step on the GIAB Bam provided with the reads, but still nothing. Keeping in mind that Google's Deepvariant has used the same approach during their model evaluation and had acquired F1 Scores for the gatk results around >0.7.
So I would be really grateful if anyone could please help me identify the problem with my run, as i seem to have run out of ideas to what could be the possible cause of those results.
My references
Genome: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta
VCF: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_v4.2_SmallVariantDraftBenchmark_07092020/HG002_GRCh38_1_22_v4.2_benchmark.vcf
Reads: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data_RNAseq/AshkenazimTrio/HG002_NA24385_son/Google_Illumina/mRNA/reads/ (hg002_gm24385.mrna.R1.fastq.gz and hg002_gm24385.mrna.R2.fastq.gz)
GTF: gencode.v45.basic.annotation.gtf
My workflow
1. Index preparation
STAR --runThreadN 15 --runMode genomeGenerate --genomeDir ./HG002/Index --genomeFastaFiles ./HG002/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta --sjdbGTFfile ./HG002/gencode.v31.basic.annotation.gtf --sjdbOverhang 149 --genomeSAsparseD 2
2. Mapping
STAR --runThreadN 15 --readFilesIn hg002_gm24385.mrna.R1.fastq.gz hg002_gm24385.mrna.R2.fastq.gz --genomeDir Index/ --outSAMtype BAM SortedByCoordinate --outFileNamePrefix mapped --outSAMunmapped Within --outSAMattrRGline ID:HG002 LB:lib1 SM:ONE PL:ILLUMINA --outSAMmapqUnique 60 --twopassMode Basic --outSAMattributes XS --readFilesCommand gunzip -c
3. Mark Duplicates
gatk --java-options "-Xmx4g -XX:ParallelGCThreads=14" MarkDuplicates -I mappedAligned.sortedByCoord.out.bam -O marked_duplicates.bam -M marked_dup_metrics.txt
4. SplitNCigarReads
gatk --java-options "-Xmx4g -XX:ParallelGCThreads=14" SplitNCigarReads -R GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta -I ./HG002/marked_duplicates.bam -O marked_split.bam
5. HaplotypeCaller
gatk --java-options "-Xmx4g -XX:ParallelGCThreads=14" HaplotypeCaller -R GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta -I ./HG002/marked_split.bam -O gatk.vcf --native-pair-hmm-threads 14
Note: It is also worth mentioning that i have not done the BQSR step from the pipeline for personal study reasons, but i have searched and do not think it could be the thing causing this huge gap in results.
Have you accounted for A>I RNA editing?
Yes, i specifically ran the RNA-seq best practices, and no i have not taken RNA editing into account considering this is not mentioned in the actual pipeline. Also, Deepvariant's publication tried the GATK pipeline as it is without any improvements and still got good results.