What are most recommended / state-of-the-art whole genome FASTQ datasets for benchmarking purposes?
5
1
Entering edit mode
9.7 years ago

What whole genome datasets would you prefer to be used for benchmarking purposes (such as for benchmarking aligners, callers, etc.)?

While NA12878 (as well as NA12891 and NA12892) have multiple datasets available and have been used extensively for benchmarking I wanted to see if the community had recommendations for other whole genome datasets that may have been sequenced using more start-of-the-art technology. Please also provide the url to the dataset(s), if available. Thanks!

FASTQ Variant-calling conversions Alignment BAM • 4.4k views
ADD COMMENT
6
Entering edit mode
9.7 years ago
lh3 33k

For whole-genome germline variant calling, the two benchmarking data sets I use are Genome In A Bottle (GIAB) and the CHM1-NA12878 pair (hapdip). For the former, you can use any NA12878 reads, e.g. from Platinum genomes. I would recommend NA12878 data from BaseSpace over AllSeq. In BaseSpace (free registration required), there are NA12878 produced from all kinds of Illumina machines with both PCR+ and PCR-free prep. In addition, AllSeq said that the data was intended to be available through 09/30/2014. We don't know when it will be pulled off. I guess Illumina is big enough to host their data longer (via S3). For the latter, you can find the links to the raw data here. That repo also provides evaluation scripts.

GIAB and hapdip are complementary to each other. GIAB is a "typical" benchmark. It provides truth data and you compare your calls against the truth. However, GIAB is biased towards easy regions. GIAB is also "excessively" clean when it excludes potential CNVs in NA12878. Given a new sample, identifying CNVs itself is non-trivial. In the end, you frequently get an underestimated error rate. In comparison, hapdip is largely unbiased, but it is more complicated as you have to deal with all kinds of tricky artifacts in variant calling. The data available for this type of benchmark is also limited. I would recommend to use both benchmarks if you want to get a more complete picture.

ADD COMMENT
3
Entering edit mode
9.7 years ago

I would suggest Illumina Platinium WGS : http://www.illumina.com/platinumgenomes/

ADD COMMENT
0
Entering edit mode

Greatly appreciate reminding me about the Platinum Genomes from Illumina - I've accessed them and will be utilizing the NA12878 FASTQs for benchmarking.

ADD REPLY
1
Entering edit mode
9.7 years ago
donfreed ★ 1.6k

Although access has supposedly closed, X ten WGS data is available from AllSeq.

http://allseq.com/x-ten-test-data

Previously mentioned in this thread:

Free HiSeq X Ten human genome fastq test data

ADD COMMENT
0
Entering edit mode

Thank you - I was able to access the FASTQ files and I'll utilize this as a bechmark!

ADD REPLY
0
Entering edit mode
9.7 years ago

For benchmarking, synthetic data - for which you know the correct answer - is much better than real data, for which the truth is subjective.

ADD COMMENT
0
Entering edit mode
7.0 years ago
jackyen • 0

Hi, I'm curious if anyone knows if there's any publically available NA12878 RNA-seq data? Thanks

ADD COMMENT
1
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 1978 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6