I noticed a high percentage of 3'UTR mutations human tumor RNAseq. Here is a paper (Fig.3) which shows < 10% of UTR mutations in their RNAseq data.
Is it unusual for such high percentage of UTR mutations? What could be some explanations? I greatly appreciate your feedback.
Here are my methods.
Sample Prep
Tumor samples RNA extraction by TRIzol (Invitrogen) and the RNeasy kit (Qiagen), Illumina HiSeq 2000, 75bp pair-end
Mapping
GATK RNAseq Best Practice - STAR-2pass with Gencode hg19 transcripts, SplitN'Trim, Indel Realignment, base recalibration.
Mutation calling
Mutect2 with default params, keep "PASS" mutations, annotated with SNPEFF
Mutation Distributions from 32 tumor samples
3'Flank 10.20%
3'UTR 42.79%
5'Flank 1.60%
5'UTR 1.49%
Frame_Shift_Del 0.29%
Frame_Shift_Ins 3.99%
IGR 0.43%
In_Frame_Del 0.05%
In_Frame_Ins 0.16%
Intron 12.62%
Missense_Mutation 16.36%
Nonsense_Mutation 0.24%
Nonstop_Mutation 0.05%
Silent 7.86%
Splice_Site 0.37%
Targeted_Region 1.49%
Translation_Start_Site 0.03%
RNA-editing isn't that prevalent in humans though. Some of the early papers from a few years ago that showed high levels of RNA editing were later shown to be very flawed. A high proportion of variants seen in RNA-seq are false positives and artefacts introduced during RNA -> cDNA conversion and PCR.
Agreed that RNA-editing detection from sequencing data is fraught with false positives.
But the main point of the question here was to determine the distribution of RNA-seq variants across CDS, UTR and intronic regions. In this context, normalization by callable bases is imperative.
Absolutely agree with you on that
Picard's rnaseqmetrics shows that majority of 43% of my reads are in coding and 27% in UTR. It doesn't seem like mapping explains the majority of the mutation.
PCT_CODING_BASES 0.436
PCT_UTR_BASES 0.276
PCT_INTRONIC_BASES 0.067
PCT_INTERGENIC_BASES 0.231
ll try CallableLoci to see if that will help.
Do you think filtering for false calls in duplicated regions, in homopolymeric regions, or close to splice junctions would be helpful?
I intersected my mutation list with regions of callable loci. The following are percentages of mutations that are callable. The percentages is high across mutation types. 3'UTR 99.37%
Missense_Mutation 98.91%
Silent 99.50%
Intron 94.8%
Can you elaborate on how to normalize the number of mutations by callable bases?
hi, The mutations you have called are from GATK pipeline (MuTect) and CallableLoci is again from GATK. So of course all mut. identified would have been Callable at the first place and hence picked by MuTect. Hence ~99% irrespective of UTR/ missense.
But anyways the Picard metrics posted by you says that 43% of aligned bases in the sample are from CDS and 27% from UTR. What I meant by using CallableLoci was to use 3 separate BEDs, one each for CDS, UTR & Introns, and calculate the effective length in each case. The idea being that probably there were more aligned bases in the UTR set.
As per Picard it doesn't seem to be so. Maybe what you are seeing is actual biological effect. Not quite sure here.
I followed the a paper advice (linked in my post) in removing mismatches within the first 6 bases of 5สน read ends due to random-hexamer priming. it cut down a good proportion of the 3' UTR mutations.
Before removing mutations in first 6bp of 5' read
3'UTR 1745 36.18%
Intron 570 11.82%
Missense_Mutation 1195 24.78%
Silent 604 12.52%
After removing mutations in first 6 bp of 5' read
3'UTR 477 18.72%
Intron 308 12.09%
Missense_Mutation 697 27.35%
Silent 364 14.29%