TCGA germline and somatic snvs
2
0
Entering edit mode
6.2 years ago
susibing ▴ 20

Dear all,

I am looking for already called and annotated germline and somatic snvs from different TCGA projects and have already been approved through dbGaP for the data access.

Unfortunately, it seems that in the harmonized data portal during variant calling all germline mutations are already filtered out (even in controlled access files) - please correct me, if I am wrong.

Therefore, as recommended in this post: How do I obtain germline mutation for TCGA samples? , I am planning to switch to the legacy data.

Here, it seems that most patients have been analyzed via two platforms, Illumina Hiseq and Illumina GA. Do you know which of those data is of higher quality and why both platforms have been used? Moreover, has anyone found some good documentation on how data of the legacy archive has been processed? (e.g. variant caller, ...).

Any help would be much appreciated!

TCGA snv legacy harmonized sequencing • 2.7k views
ADD COMMENT
2
Entering edit mode
6.2 years ago

There are indeed tumour and normal VCF files (separate) in the GDC Legacy. Indel and SNV calls appear to be split across different files.

Once you download these, you can look up the TCGA barcode via the UUID or filename using these functions which have relatively recently been posted on Biostars:

Regarding the Genome Analyser versus the HiSeq, it's a reflection of the fact that the samples were sequenced in different institutions. The VCFs are not large, so, why not just download both separately and then determine the samples to which they both relate? For TCGA DNA-seq, generally, I believe the GA was used more than the HiSeq. You could also just merge all of the files together after you have curated both of these groups of VCFs separately. The final data-point is just a boolean of whether the variant is present or not, after all.

You can also just obtain all of the BAMs, which I am currently doing for one of the TCGA cancers. However, it's another minefield to deal with due to the data load and the fact that BAMs were seemingly aligned to different genomes within the same genome release.

Kevin

ADD COMMENT
1
Entering edit mode

Thank you very much!

ADD REPLY
1
Entering edit mode
6.0 years ago
susibing ▴ 20

Just as an additional: after being in contact with the NCI GDC support, turns out that germline snvs are hidden in the aggregated-somatic_mutation files that are with closed access, just not annotated as germline. I was told to overlap the open access and controlled access files to retrieve the germline snvs. However, I believe that some of the mutations have been previously filtered out by the pipeline due to bad quality.

ADD COMMENT

Login before adding your answer.

Traffic: 1866 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6