Is it correct to merge TCGA mutation data from multiple centers
2
2
Entering edit mode
8.9 years ago
vakul.mohanty ▴ 270

Hello,

I recently started working with TCGA protected mutation data. The SNP calls provided are made across several centers using different sequencing protocols and SNP calling algorithms. I was hoping to get pointers to literature that describes the SNP calling protocols across various centers, and if there's a difference in quality across them.

I would also be grateful for comments and indications oh whether the mutation calls across can be integrated and if yes how can I do that.

Thanking You,
Vakul

TCGA SNV • 2.7k views
ADD COMMENT
3
Entering edit mode
8.9 years ago
trausch ★ 1.9k

My two cents on this is that it's damn difficult to compare somatic variants from different pipelines/sequencing centers. Have a look at PCAWG - Pancancer Analysis of Whole Genomes, this projects aims to get rid of the analysis variability by subjecting all tumor/numor samples to a uniform set of alignment and variant calling algorithms, and all samples must pass a rigorous set of qualilty control tests. This hopefully ensures comparability (but the differences in sequencing protocols, chemistry, GC-bias, insert size, etc. obviously persist).

ADD COMMENT
0
Entering edit mode

Thank you,

This might provide some insight into how I can integrate or filter TCGA's germline level mutation data.

ADD REPLY
2
Entering edit mode
8.9 years ago

It's an area of active research. If you take the intersection of all the centers (or callers), you'll get a very-high specificity set of calls, but will miss large numbers of true positives, especially low-VAF calls. If you go the other way, and do a union, you'll be very sensitive, but the FP rate is ridiculous. More intelligent merging algorithms are needed.

This paper we published in 2015 compared a few different callers on a highly-confident set of somatic mutations going down to 1% VAF or lower. Figure 4D and the supplement will give you some guidance on how various combinations of callers perform: http://www.cell.com/cell-systems/abstract/S2405-4712(15)00113-1

ADD COMMENT
1
Entering edit mode

Thanks for the pointer.

I was looking at the VCF file from protected data of TCGA and found a high disparity in the number of SNPs identified as well as the number of SNPs passing TCGA's own quality control measures. I'm looking to analyse germline mutations and want to start with a set of high confidence SNPs. In interest of reducing FP rates would it be wise to chose a single platform to work on across cancers or is it comparable if I use different datasets across cancer? Also is it possible to further filter the VCF files to remove any FP variants from them?

Thanx!

ADD REPLY
2
Entering edit mode

The VCF files are not from germline variants - they're somatic calls. If you want to analyze germline variants, you'll need access the bam files and call germline mutations yourself.

ADD REPLY

Login before adding your answer.

Traffic: 1798 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6