Question

Where Can I Find Ngs/Arrays Sample Datasets?

7

Entering edit mode

15.3 years ago

Jarretinha 3.5k

Hi everyone,

I'm looking for NGS/Arrays sample datasets for teaching purposes. Most people I know that have this kind of data are somewhat jealous about them. I'm interested in the whole data pipeline. So, raw data is very much wanted too. Don't need to be a huge dataset, just a very illustrative one. If you know where can I find any, please, let me know too !!!

-- Edit --

Just to clarify: I don't need a complete dataset from each pipeline step. A small real sample is enough. The real part is crucial. Simulations are welcome ,too.

next-gen sequencing array dataset • 10.0k views

ADD COMMENT • link updated 4.6 years ago by Biostar 20 • written 15.3 years ago by Jarretinha 3.5k

3

Entering edit mode

One thing to note is that many of the raw data processing tools are not easy to install and require a full directory structure to operate correctly. Here is a page with information and official guides for the Illumina pipeline that I prepared for our group.

ADD REPLY • link updated 6.0 years ago by Ram 45k • written 15.3 years ago by Istvan Albert 103k

1

Entering edit mode

How "raw" do you want it? Take the Illumina system. Data processing starts with images. Image analysis transforms those into to intensities; intensities are transformed into basecalls; basecalls are then mapped to the genome.

ADD REPLY • link 15.3 years ago by Istvan Albert 103k

0

Entering edit mode

Would anyone provide big image data? The most raw I have seen in archives was fastq so far.

ADD REPLY • link 15.3 years ago by Michael 56k

0

Entering edit mode

I was only asking this because of Jarretinha mentioned that he wants to learn about the "whole pipeline". The size of the images from a single run total around 2 terrabytes - for that the fastest transfer bandwidth is shipping hard disks.

ADD REPLY • link 15.3 years ago by Istvan Albert 103k

0

Entering edit mode

Yeah, when I say raw it's as raw as you can. Raw as in fresh meat. I do know that a whole run on any NGS machine is too much data for any connection. But, a dozen images are more than enough to illustrate image processing concepts on real NGS data. And it need to be real cause my target audience is mainly composed non computer/biology geeks. They really must feel the complexity of the task. By the way, surface mail is allowed in answers :)

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

0

Entering edit mode

No snail mail necessary ;) just googled it out.

ADD REPLY • link 15.3 years ago by Michael 56k

Ram · Answer 1 · 2010-04-09

7

Entering edit mode

15.3 years ago

Michael 56k

How about the NCBI Sequence Reads Archive

And for arrays and RNA-seq (it's a bit hard to find exactly that): - GEO - ArrayExpress

Here's an illustrative RNA-seq example which is also not too big.

Edit: a link to Illumina raw image data and the SWIFT software:

Found the SWIFT software for primary analysis of Illumina data. There's a link to example tile data. Here's the article.

Maybe the authors know how to get more sample data.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 15.3 years ago by Michael 56k

1

Entering edit mode

What do you mean by "raw data are very hard to find"? Are you talking about actual images from the sequencers? Most newer pipelines only store those temporarily, as the terabytes of storage necessary just aren't worth it. The fastq files in the SRA, with unmapped reads and uality scores, are about as close to raw data as you'll readily find. Same goes for the CEL files from microarrays. Showing a representative image can be useful while explaning concepts, but very few analyses that we do actually start from the raw images.

ADD REPLY • link 15.3 years ago by Chris Miller 22k

0

Entering edit mode

That's the spirit !!! Traces are a very good start. And arrays tables too. NCBI/EBI have a huge amount of datasets. But, raw data are really difficult to find. If you know someone that can help with this, would be great !

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

0

Entering edit mode

I know that most seq-related pipelines start rather far from raw data. But, my purpose is teaching. So, I think that showing the barebones of a microarray/sequencing machine can be very motivational, for bad or good. And data found on vendors sites are too clean to be real. Besides that, my bioinformatics training started with x-ray crystallography. Microarray images are not so different from diffraction patterns. I think they're both funny !!!

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

Ram · Answer 2 · 2010-04-09

3

Entering edit mode

15.3 years ago

Paulo Nuin ★ 3.7k

Simulate your own and be merry.

http://linuxjunk.blogspot.com/2009/08/attempt-to-use-bowtie-on-simulared.html

ADD COMMENT • link updated 6.0 years ago by Ram 45k • written 15.3 years ago by Paulo Nuin ★ 3.7k