Question

Reading Genotyping Data From Illumina Genomestudio Into R

11

Entering edit mode

14.9 years ago

Farrel ▴ 240

We have recently conducted 1.1 million snp/cnv genotyping on a sample of subjects using the infinium assay. The data is currently a project within Illumina GenomeStudio. I have imported some columns containing pedigree, affected status, race and ethnicity to the project but I also have that data in a separate table.

How do I read the data from genomestudio into R? Are you aware of any published examples or case vignettes?

Is beadarraySNP the package to use?

The data seems to be stored in a directory with the following files.

tabledat.bin, pairtable.bin, seqdata.bin, sd.bin, heredity.bin, Duplicates.bin, projdat.bin, PairedData.bin, ad.bin, ld.bin

How does one go from those files to reading the data into R?

I want to end up with a dataframe that has as many rows as I have subjects and as many columns as I have snp markers + cnv markers + pedigree fields + phenotype fields

r bioconductor illumina snp genotyping • 25k views

ADD COMMENT • link updated 14.9 years ago by Abc ▴ 10 • written 14.9 years ago by Farrel ▴ 240

Ram · Answer 1 · 2010-08-25

To import your data into R with beadarraySNP, you'll have to create a report from Genomestudio through the report wizard

From the Analysis menu choose Reports > Report wizard...
Now choose Final report
Select the samples you want included, click next
Choose the Standard radio-button on top, and the Tab radio-button in General options
Now you can select the fields you want in your report. At the very least beadarraySNP requires the SNP Name and Sample ID fields. Read the BeadStudio Data section of the read.SnpSetIllumina() man page to get all options.
Check the Create MAP files to get a head start on creating a sample sheet
Click Next and Finish to create the report files

The data can now be read into R with a command like

myData <-read.SnpSetIllumina(Sample_Map2Samplesheet("Sample_Map.txt"),reportfile="myData_FinalReport.txt")

Do not forget to add the nochecks=TRUE when you did not put all required fields in your report.

Data columns are put in matrices in the assayData slot of the resulting object, while annotation fields are put in the featureData slot of the object.

score 2 · Answer 2 · 2010-08-26

As Jan says, R/Bioconductor works best with the reports exported from Illumina's proprietary "Studio" software. There are very few (if any) options for processing raw, binary data files directly using R.

I recently made some notes about Illumina and Bioconductor packages on our (internal) wiki. I've pasted them below, almost "as is" - maybe you can glean something from them. In summary: the best approach is to export from Illumina software to text files and import to R using read.table().

beadarray

reads bead-level or bead-summary data
- bead-summary requires at minimum the file SampleProbeProfile.txt
- data files are generated by Illumina BeadStudio software (gene expression module)
- method readBeadSummaryData() creates ExpressionSetIllumina object
- bead-level requires txt/csv files and optionally, TIFFs, targets.txt, annotation and metrics files
- these are generated by Illumina BeadScan software
- method readIllumina() creates BeadLevelList object

crlmm

reads binary idat files from the Illumina scanner (+ a CSV description file)
method readIdatFiles() creates NChannelSet object

lumi

reads "the Illumina raw data output of the Illumina Bead Studio toolkit from version 1 to version 3"
the "probe profile" output is preferred
method lumiR() creates a LumiBatch object

beadarraySNP

read.SnpSetIllumina() method notes:

BeadStudio Data

To process experiments that were processed with BeadStudio, only two files are needed; the sample sheet and the Final Report file
The sample sheet must contain the same columns as for GenCall, the report file should contain the following columns: ‘SNP Name’, ‘Sample ID’, ‘GC Score’, ‘Allele1 - AB’, ‘Allele2 - AB’, ‘GT Score’, ‘X Raw’, and ‘Y Raw’
‘SNP Name’ and ‘Sample ID’ are used to form rows and columns in the experimental data, ‘GC Score’ is put in the callProbability matrix, ‘Allele1 - AB’ and ‘Allele2 - AB’ are combined into the call matrix, ‘GT Score’ is added to the featureData slot, ‘X Raw’ is put in the R matrix and ‘Y Raw’ in the G matrix.
Other columns in the report file are added as matrices in the assayData slot, or columns in the featureData slot if values are identical for all samples in the reportfile

Ram · Answer 3 · 2010-08-25

1

Entering edit mode

14.9 years ago

User 59 13k

crlmm?

ADD COMMENT • link updated 6.2 years ago by Ram 45k • written 14.9 years ago by User 59 13k

0

Entering edit mode

This works with the data that was created by scan studio, not Genome Studio

ADD REPLY • link 14.9 years ago by Jan Oosting ▴ 920

0

Entering edit mode

Didn't know that Jan, cheers.

ADD REPLY • link 14.9 years ago by User 59 13k

score 1 · Answer 4 · 2010-10-19

The Bonsai report-plug-in allows GenomeStudio to export data directly as Rdata suitable for the bioconductor package snpMatrix. There are other goodies on the sourceforge web site http://outmodedbonsai.sourceforge.net/ also. The author is apparently working on CNV analysis lately, and had managed to run GenomeStudio on linux. Don't know how it is done though.