Question

Problem with Calling Variants from RNA-Seq data

0

Entering edit mode

13 months ago

Esraa ▴ 10

Hello, I have previously posted that i was trying to find a benchmark RNA-Seq dataset for variant calling of SNPs and Indels. Someone recommended using the GIAB RNA-Seq reads and comparing it with the truth set of the same sample.

I tried running the GATK Best practices pipeline for short variant discovery on the data and comparing my results with theirs using hap.py, but got very unexpected bad results of F1 Scores 0.04, which to me indicates that i seem to be doing something wrong.

I tried troubleshooting by re-checking my references, tool parameters, even running the variant calling step on the GIAB Bam provided with the reads, but still nothing. Keeping in mind that Google's Deepvariant has used the same approach during their model evaluation and had acquired F1 Scores for the gatk results around >0.7.

So I would be really grateful if anyone could please help me identify the problem with my run, as i seem to have run out of ideas to what could be the possible cause of those results.

vcf gatk benchmark giab rna-seq • 830 views

ADD COMMENT • link 13 months ago by Esraa ▴ 10

0

Entering edit mode

My references

Genome: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta

VCF: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_v4.2_SmallVariantDraftBenchmark_07092020/HG002_GRCh38_1_22_v4.2_benchmark.vcf

Reads: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data_RNAseq/AshkenazimTrio/HG002_NA24385_son/Google_Illumina/mRNA/reads/ (hg002_gm24385.mrna.R1.fastq.gz and hg002_gm24385.mrna.R2.fastq.gz)

GTF: gencode.v45.basic.annotation.gtf

My workflow

1. Index preparation

STAR --runThreadN 15 --runMode genomeGenerate --genomeDir ./HG002/Index --genomeFastaFiles ./HG002/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta --sjdbGTFfile ./HG002/gencode.v31.basic.annotation.gtf --sjdbOverhang 149 --genomeSAsparseD 2

2. Mapping

STAR --runThreadN 15 --readFilesIn hg002_gm24385.mrna.R1.fastq.gz hg002_gm24385.mrna.R2.fastq.gz --genomeDir Index/ --outSAMtype BAM SortedByCoordinate --outFileNamePrefix mapped --outSAMunmapped Within --outSAMattrRGline ID:HG002 LB:lib1 SM:ONE PL:ILLUMINA --outSAMmapqUnique 60 --twopassMode Basic --outSAMattributes XS --readFilesCommand gunzip -c

3. Mark Duplicates

gatk --java-options "-Xmx4g -XX:ParallelGCThreads=14" MarkDuplicates -I mappedAligned.sortedByCoord.out.bam -O marked_duplicates.bam -M marked_dup_metrics.txt

4. SplitNCigarReads

gatk --java-options "-Xmx4g -XX:ParallelGCThreads=14" SplitNCigarReads -R GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta -I ./HG002/marked_duplicates.bam -O marked_split.bam

5. HaplotypeCaller

gatk --java-options "-Xmx4g -XX:ParallelGCThreads=14" HaplotypeCaller -R GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta -I ./HG002/marked_split.bam -O gatk.vcf --native-pair-hmm-threads 14

Note: It is also worth mentioning that i have not done the BQSR step from the pipeline for personal study reasons, but i have searched and do not think it could be the thing causing this huge gap in results.

ADD REPLY • link 13 months ago by Esraa ▴ 10

0

Entering edit mode

Have you accounted for A>I RNA editing?

ADD REPLY • link 13 months ago by i.sudbery 21k

0

Entering edit mode

Yes, i specifically ran the RNA-seq best practices, and no i have not taken RNA editing into account considering this is not mentioned in the actual pipeline. Also, Deepvariant's publication tried the GATK pipeline as it is without any improvements and still got good results.

ADD REPLY • link 13 months ago by Esraa ▴ 10