Reading Genotyping Data From Illumina Genomestudio Into R
4
11
Entering edit mode
14.3 years ago
Farrel ▴ 240

We have recently conducted 1.1 million snp/cnv genotyping on a sample of subjects using the infinium assay. The data is currently a project within Illumina GenomeStudio. I have imported some columns containing pedigree, affected status, race and ethnicity to the project but I also have that data in a separate table.

How do I read the data from genomestudio into R? Are you aware of any published examples or case vignettes?

Is beadarraySNP the package to use?

The data seems to be stored in a directory with the following files.

tabledat.bin, pairtable.bin, seqdata.bin, sd.bin, heredity.bin, Duplicates.bin, projdat.bin, PairedData.bin, ad.bin, ld.bin

How does one go from those files to reading the data into R?

I want to end up with a dataframe that has as many rows as I have subjects and as many columns as I have snp markers + cnv markers + pedigree fields + phenotype fields

r bioconductor illumina snp genotyping • 24k views
ADD COMMENT
3
Entering edit mode
14.3 years ago
Jan Oosting ▴ 920

To import your data into R with beadarraySNP, you'll have to create a report from Genomestudio through the report wizard

  • From the Analysis menu choose Reports > Report wizard...
  • Now choose Final report
  • Select the samples you want included, click next
  • Choose the Standard radio-button on top, and the Tab radio-button in General options
  • Now you can select the fields you want in your report. At the very least beadarraySNP requires the SNP Name and Sample ID fields. Read the BeadStudio Data section of the read.SnpSetIllumina() man page to get all options.
  • Check the Create MAP files to get a head start on creating a sample sheet
  • Click Next and Finish to create the report files

The data can now be read into R with a command like

myData <-read.SnpSetIllumina(Sample_Map2Samplesheet("Sample_Map.txt"),reportfile="myData_FinalReport.txt")

Do not forget to add the nochecks=TRUE when you did not put all required fields in your report.

Data columns are put in matrices in the assayData slot of the resulting object, while annotation fields are put in the featureData slot of the object.

ADD COMMENT
2
Entering edit mode
14.3 years ago
Neilfws 49k

As Jan says, R/Bioconductor works best with the reports exported from Illumina's proprietary "Studio" software. There are very few (if any) options for processing raw, binary data files directly using R.

I recently made some notes about Illumina and Bioconductor packages on our (internal) wiki. I've pasted them below, almost "as is" - maybe you can glean something from them. In summary: the best approach is to export from Illumina software to text files and import to R using read.table().

beadarray

  • reads bead-level or bead-summary data
    • bead-summary requires at minimum the file SampleProbeProfile.txt
    • data files are generated by Illumina BeadStudio software (gene expression module)
    • method readBeadSummaryData() creates ExpressionSetIllumina object
    • bead-level requires txt/csv files and optionally, TIFFs, targets.txt, annotation and metrics files
    • these are generated by Illumina BeadScan software
    • method readIllumina() creates BeadLevelList object

crlmm

  • reads binary idat files from the Illumina scanner (+ a CSV description file)
  • method readIdatFiles() creates NChannelSet object

lumi

  • reads "the Illumina raw data output of the Illumina Bead Studio toolkit from version 1 to version 3"
  • the "probe profile" output is preferred
  • method lumiR() creates a LumiBatch object

beadarraySNP

read.SnpSetIllumina() method notes:

BeadStudio Data

  • To process experiments that were processed with BeadStudio, only two files are needed; the sample sheet and the Final Report file
  • The sample sheet must contain the same columns as for GenCall, the report file should contain the following columns: ‘SNP Name’, ‘Sample ID’, ‘GC Score’, ‘Allele1 - AB’, ‘Allele2 - AB’, ‘GT Score’, ‘X Raw’, and ‘Y Raw’
  • ‘SNP Name’ and ‘Sample ID’ are used to form rows and columns in the experimental data, ‘GC Score’ is put in the callProbability matrix, ‘Allele1 - AB’ and ‘Allele2 - AB’ are combined into the call matrix, ‘GT Score’ is added to the featureData slot, ‘X Raw’ is put in the R matrix and ‘Y Raw’ in the G matrix.
  • Other columns in the report file are added as matrices in the assayData slot, or columns in the featureData slot if values are identical for all samples in the reportfile
ADD COMMENT
0
Entering edit mode

Or convert it to PLINK then handel with GenABEL, etc: Converting illumina raw genotype data into PLINK PED format

ADD REPLY
1
Entering edit mode
14.3 years ago
User 59 13k

crlmm?

ADD COMMENT
0
Entering edit mode

This works with the data that was created by scan studio, not Genome Studio

ADD REPLY
0
Entering edit mode

Didn't know that Jan, cheers.

ADD REPLY
1
Entering edit mode
14.1 years ago
Abc ▴ 10

The Bonsai report-plug-in allows GenomeStudio to export data directly as Rdata suitable for the bioconductor package snpMatrix. There are other goodies on the sourceforge web site http://outmodedbonsai.sourceforge.net/ also. The author is apparently working on CNV analysis lately, and had managed to run GenomeStudio on linux. Don't know how it is done though.

ADD COMMENT

Login before adding your answer.

Traffic: 1770 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6