Question

Somatic mutations from RNA-Seq data of tumor vs. normal

2

Entering edit mode

9.5 years ago

gil.hornung ▴ 100

Hi,

I have RNA-Seq data from a "control" cell line and the same cell line in which a an oncogenic transformation was induced (3 samples of each).

I want to look for SNVs that are characteristic of the transformed cells. I have followed the GATK best practices for calling variants in RNA-Seq, which includes alignment with STAR 2-pass followed by GATK splitNtrim (splits reads into exon segments).

I now want to perform the variant calling step. I believe that using GATK's HaplotypeCaller (as detailed in the workflow) is inappropriate because, first, I have a mixture of cells with possibly different somatic mutations, and second, it does not compare between the control and transformed lines. Tools such as MuTect seem to be more appropriate. Nevertheless, I don't know if such can be used on RNA-Seq data. For example, the read depth may be highly variable between the "control" and "tumor" samples, because there are large differences in gene expression between the two.

Does any of you know of any tools that can use RNA-Seq as input for calling variants in the tumor vs. normal setting?

Thank you Gil

RNA-Seq • 5.5k views

ADD COMMENT • link updated 9.3 years ago by Amitm ★ 2.3k • written 9.5 years ago by gil.hornung ▴ 100

score 1 · Answer 1 · 2016-01-11

1

Entering edit mode

9.3 years ago

Amitm ★ 2.3k

hi,

If you have used the GATK Best Practices for calling variants in RNA-seq, then the final BAM files can be used with MuTect (or any other caller that you want). I did not have paired RNA-seq but had tumor only. I used HaplotypeCaller and VarScan (on mpileup from GATK processed BAM). Results were fine.

Two caveats -

In RNA-seq, unlike WES or WGS, you read depth is not ONLY dependent on the sequencing coverage but also on RNA expression. So many variants present in under-expressed mRNAs can be missed.

2) Because of RNA expression linked seq. depth, the coverage filters used in callers like call only when at least 10 reads present is also not optimal. Especially when you have removed PCR duplicates, many loci are left with <10 reads.

Just shared my experience with polyA+ve RNA-seq with read range in 40-70million.

ADD COMMENT • link 9.3 years ago by Amitm ★ 2.3k

0

Entering edit mode

I'm also facing the issue of calling somatic mutation from RNA-seq data without a normal tissue: how did you filtered the resulting VCF? I mean, lots of variants will be SNPs. Also, how did you account for tumour heterogeneity? Just asking because the problem here is "are we able to describe clonal composition from RNA-seq only?"

ADD REPLY • link 8.5 years ago by cittaro.davide • 0

0

Entering edit mode

hi,

This is what I did for filtering the calls from RNA-seq. First annotate the calls with the read depth info. (no. of reads supporting the variant & ref. allele). Then if you look at the read depth for the var. allele, you would see most have ~2 supporting reads only. Caveat - This is what I see in the above stated library size processed using STAR and then using GATK recommended guidelines for RNA. The median read depth for var. allele is ~2 for most samples I can recall.

So, I select only those SNVs that have >10 reads supporting the var. allele. Only 3-5% of SNVs are retained. But I find this a reliable set to do 'discovery'. I can alway check for detection for any specific variant in the total set.

I haven't addressed heterogeneity using var. calls from RNA. As I noted in the previous post, what variants one calls in the RNA-seq has multiple confounders (the biggest being expression level) and hence very grey area. Similar for, I think, clonal composition from RNA. At least for RNA sequencing library preps. that have PCR amplification step involved (like Illumina).

ADD REPLY • link 8.5 years ago by Amitm ★ 2.3k