Hi
Where could I find data sets of cancer pairs (tumor/normal genome) publicly available?
I would like to test some tools of variants detection against it.
I have found already the data set of Complete Genomics but it doesn't come in the SAM format. They provide tools to convert to SAM but so far I haven't managed to run/compile those.
So please let me know if there are any other data sets available.
Regards
I guess here you want pairs with known/expected somatic mutations and frequencies. Did you find anything in this format?
Yes that would be interesting to have this too. But first I would like to have an alignment of short reads of a tumor against a "normal" genome. I "think" that this is what I would like to have.
It depends on the approach but I would assume you start with 2 bam files - one for normal and one for tumor. The call variants for each and and compare. What program do you plan to test?
Precisely, 2 bam files one for normal one for tumor. I'm benchmarking variants detection methods and I want to include a cancer dataset in my study. I'm trying to play with as much algorithms as I can (e.g. gatk, dindel, breakdancer, etc. and one home-made proto). Do you think this is a good idea?
I think that for normal vs tumor somatic mutation detection you need to use a specialised program that considers the heterogeneity of the tumor sample. Have a look at Somatic Sniper. Calling polymorphisms is a different game and more suited to Dindel, GATK. SAMTools etc.
Few points on the CG data: it is from cell-lines, so not applicable if you're planning to bench mark variable fraction allele callers. In addition, converting CG data to BAM is a pain, and would require variant calling algorithms that are tuned to the characteristics of CG's mated gapped reads. Finally the CG datasets are at double the normal coverage.