Question

Target cancer sequencing: can't find database with raw data and related publications

0

Entering edit mode

5.4 years ago

mariafirulevabio ▴ 40

Dear all,

I'm trying to find raw target sequencing data (bam or fastq and vcf/gvcf) of any cancer types. I want to get publications associated with these data, because I need info about confirmation process description of allele frequency of founded variants in vcf (e.g., digital PCR). However, databases which I know don't provide biological validation of stored NGS data.

Hope you can help me.

target sequencing database cancer open data • 1.4k views

ADD COMMENT • link 5.4 years ago by mariafirulevabio ▴ 40

0

Entering edit mode

I am not sure if these raw data are available, because of privacy policies.

ADD REPLY • link 5.4 years ago by Benn 8.3k

0

Entering edit mode

These data can be under controlled access (e.g., TCGA).

ADD REPLY • link 5.4 years ago by mariafirulevabio ▴ 40

0

Entering edit mode

Yes, indeed, and do you have access? If not, you can download some raw FASTQ data from cancer studies at SRA, process these, and then produce your own BAMs and VCFs,

ADD REPLY • link 5.4 years ago by Kevin Blighe 88k

0

Entering edit mode

I need annotated VCF files (or BAM with variant calling description, if it was done) as in silico control for bioinformatics pipeline and, also, I need wet lab confirmation of observed variant allele frequencies. I guess, my question is similar to this post.

ADD REPLY • link 5.4 years ago by mariafirulevabio ▴ 40

0

Entering edit mode

For NGS data, you may struggle to find a normal sample for whom the variants have been confirmed in the wet lab. If you can imagine, validating all variants would be a costly and time-consuming task. GIAB (Genome in a Bottle) have samples for whom variants have been confirmed in parallel by multiple variant calling methods, but these are neither confirmed in the wet lab.

If you search the online repositories (mainly SRA - sequence read archive), then you may find what you need.

What in the other post (by Cyriac) is not 100% in line with what you need, or does the post by Cyriac 100% address your question?

ADD REPLY • link 5.4 years ago by Kevin Blighe 88k

0

Entering edit mode

Is a biological validation a costly and time-consuming task for variants from targeted sequencing? Can validation be performed only for the pool of interested variants (e.g., hot spots)?

Cyriac addressed to NCI's GDC Legacy Archive for validated BAM files, however, it is a bioinformatic validation. I found another Cyriac post, but I can't find files related to the second point of "How TCGA MAFs are made" header.

ADD REPLY • link 5.4 years ago by mariafirulevabio ▴ 40

score 0 · Answer 1 · 2019-06-24

0

Entering edit mode

5.4 years ago

mariafirulevabio ▴ 40

Since I've not found the answer, I guess these links (post, paper) will be useful for someone with the same aims. I've decided to use a mixture of two Genome in a Bottle samples (truth set is available) for somatic variant calling validation. There is an option to choose a desired gene panel and filter variants in both truth set and output from alignment and variant calling pipeline.

I would appreciate any pieces of advice related to my original question and strategy which I described in this post.

UPD: another useful link.

ADD COMMENT • link 5.4 years ago by mariafirulevabio ▴ 40

2

Entering edit mode

Just keep in mind that, despite the Genome in a Bottle calling their datasets 'truth sets', they most likely still contain false positive and negative calls. Their 'truth' sets were defined by processing the same samples multiple times with difference sequencers; however, each sequencer has its own associated error.

ADD REPLY • link 5.4 years ago by Kevin Blighe 88k

1

Entering edit mode

Thanks, Kevin! I suppose it is better for me to choose a target panel which doesn't overlap complicated regions (repeats, high GC-content sequences, etc).

ADD REPLY • link 5.4 years ago by mariafirulevabio ▴ 40

0

Entering edit mode

Indeed, particularly repeat sequence / regions with sequence similarity (there are many!)

ADD REPLY • link 5.4 years ago by Kevin Blighe 88k