Illumina HumanHT-12 V4.0 expression beadchip
3
3
Entering edit mode
4.7 years ago
zelda ▴ 50

I'm quite new to the bioinformatics world and just started working with the GEO data set GSE119600. My intention is to extract deferentially expressed genes and will start my work using the Lumi package to import ,read and normalize the raw data into R. Then I would use LIMMA package to generate the deferentially expressed genes. My questions are:

1- what file is suitable to import the raw data from? As there are 2 types of files : RAW.tar and non-normalized.txt.gz. 2- what is the general workflow for the lumi pakcage?

Thanks!

R lumi LIMMA • 7.8k views
ADD COMMENT
0
Entering edit mode

hi zelda, i'm having the same problem with raw data from illumina beadchip in GEO. How you solved your problem ?

ADD REPLY
0
Entering edit mode

Hi Zelda, i'm having the same problem... How do you proceed ?

ADD REPLY
0
Entering edit mode

Please elaborate on the problem. Please show what you have already tried, and share any warning and / or error messages that have appeared.

ADD REPLY
0
Entering edit mode

I''m quite new in bioinformatics world and started working with GEO dataset GSE42023. Our goals is to extract deferentially expressed genes . My question is:

The gse42023 provides me 2 types of files : RAW.tar and non-normalized.txt.gz. I know that : i need normalize this data to proceed with dea analisis, but i am really confused with this data.

Specially in non-normalized.txt , i have the genes (rows) , and the samples and p-value detection(collumns), in this case, how to proceed ?

Thanks! Example of non_normalized.txt

ID_REF YTY 82T Detection Pval YTY 84T Detection Pval YTY 88T Detection Pval YTY 78T ILMN_2104295 11777.34 0 15654.76 0 16447.22 0 18152.73 ILMN_1804851 47.92878 0.09337349 64776 0.01506024 50.35888 0.06626506 87.81049 ILMN_2412624 779.1193 0 994.2435 0 752119 0 1202821 ILMN_2402629 66.27144 0.009036144 63.89339 0.01957831 46.61563 0.1325301 62.49742
ADD REPLY
0
Entering edit mode

Unfortunately, you have the same problem as many people.

For some reason, for the Illumina microarray studies, GEO requires that authors upload data in this non-standard format. I am not sure that you can use the standard Bioconductor package, lumi, for this. Instead, you may have to process this manually, and I provide an advanced workflow here: A: illumina Arrays Illumina HumanHT-12 V3.0 expression beadchip reading data

ADD REPLY
0
Entering edit mode

Thank's for your response Kevin!!! I am following your workflow. But i have a question: How normalize this data ? Now i have a matrix with ID ref and p-value detection for all samples. Please forgive me for so many questions.

ADD REPLY
0
Entering edit mode

Hey, you need to remove the detection p-value columns. I explain this in the other post (but I do not provide the code):

"You should then extract out the Detection PVal columns and save them for later, and also set the rownames of the object to be equal to ID_REF. The final x object should be just the expression levels, and it should be a data-matrix."

The normalisation is then performed with the neqc() function, also mentioned in the other post.

Sorry, this is a very non-standard workflow.

ADD REPLY
0
Entering edit mode

So, i normalize this data: ( ID_REF and the counts of genes)

ID_REF                  YTY 82T YTY 84T YTY 88T YTY 78T YTO 68T YTO 92T YTO 70T YTY 94T
ILMN_2104295    11777.34    15654.76    16447.22    18152.73    17644.91    12811.94    14148.09    28912.88
ILMN_1804851    47.92878    64776   50.35888    87.81049    81.91159    81.92358    54.1439 157.5312
ILMN_2412624    779.1193    994.2435    752119  1202821 1129763 951.2262    1179794 3904947
ILMN_2402629    66.27144    63.89339    46.61563    62.49742    65.48244    68.64259    86.67648    147.1232
ILMN_1713732    132.8008    251.0701    184.7917    406.4644    465953  252.8271    206557  610.4857

Or This data: (ID_REF and just the p-value detection)?

ID_REF                 YTY 82T         YTY 84T         YTY88T            YTY 78T     YTO 68T                YTO 92T
ILMN_2104295    0                    0                              0                 0           0                           0
ILMN_1804851    0.09337349  0.01506024  0.06626506  0.03313253  0.01054217  0.007530121
ILMN_2412624    0   0   0   0   0   0
ILMN_2402629    0.009036144 0.01957831  0.1325301   0.1656626   0.05572289  0.02259036
ILMN_1713732    0.000156345 2.80E-06    0.000959368 3.32E-06    2.33E-06    2.05E-05
ILMN_1697220    0.000163673 0.01506584  0.008168896 0.009110582 0.001960087 0.02678343
ILMN_1807610    0   0.006024096 0.001506024 0.004518072 0.003012048 0.02108434
ILMN_2408039    0.07981928  0.1536145   0.2003012   0.128012    0.2289157   0.2153614
ILMN_1802167    0.009036144 0.004518072 0.009036144 0.05421687  0.01054217  0
ADD REPLY
1
Entering edit mode

You need to set ID_REF as rownames, and then remove that first column (ID_REF) from the data (in both cases).

You later use the detection p-values in the following section in my other post: 'filter out control probes, those with no symbol, and those that failed' (these detection p-values will be contained in the object detectionpvalues)

The columns of both objects also should be aligned perfectly.

ADD REPLY
2
Entering edit mode
13 months ago
Winston ▴ 20

I know this is (very) late but I should point out that the manifest file (that is, the annotation list stored in the GSEnnnnnn_RAW.tar) was not annotated anywhere nearly as clearly as one would expect. There is a supplemental table in the manifest file that identifies all control probe_IDs, and many were not indicated as "ILMN_Controls" in annot$source as Kevin had observed.

If you load the manifest file into either RStudio or MS Excel, GPL10558_HumanHT-12_V4_0_R1_15002873_B.txt into either RStudio or MS Excel and scroll to the very bottom of the data frame (just above the column label legend), you can find a complete list of all possible control probes and the corresponding probe IDs. It's incredibly easy to miss but there is a bracketed header labeled as ' [Controls] ' that identifies the list, located at row 47242. I used MS Excel this time because I'm just too sleep deprived to reformat the file to accommodate RStudio's phobia of duplicate row names right now.

Tl;dr the manifest file contains control probe_IDs. you still may be able to query your dataset's probe_ID list and filter accordingly prior to normalization.

ADD COMMENT
1
Entering edit mode
4.7 years ago

Hey,

Like many other Illumina studies on GEO, this study (GSE119600) does not contain the original raw data IDAT files, which is unfortunate.

GSE119600_RAW.tar just contains the probe annotations for GPL10558 Human HT-12 v4 R1 and R2 (2 separate files). GSE119600_non-normalized.txt.gz just contains a single table of expression values. i am not sure that these are suitable for lumi, but you are welcome to try.

As a beginner, the easiest option for you is to use GEO2R to obtain the expression data as an ExpressionSet object. From the main accession page ( https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE119600 ), click on the Analyze with GEO2R button.

Then, click on the R script tab, which will generate R code that you can use to obtain the data.

--------------------

Alternatively, you can adapt a previous answer of mine such that you can still use this raw data with limma:

In this case, the annot object should be one of those annotation files in the GSE119600_RAW.tar file. The targetinfo object you will have to create yourself.

Kevin

ADD COMMENT
1
Entering edit mode

Thank you Kevin! It is much appreciated.

ADD REPLY
0
Entering edit mode

Hi Kevin i had the same problem with GSE140830_raw_data.txt.gz. i created targetinfo and annot matrices and followed your workflow. but i getting an error, when i try to normalize by neqc():

Error in if (alpha <= 0) stop("alpha must be positive") : missing value where TRUE/FALSE needed

It should be great if you can give me any suggestions to fix this error.

Thanks

ADD REPLY
0
Entering edit mode

Hello, you will have to show all of your code so that I can test it. Also, if you are referring to my code here, A: illumina Arrays Illumina HumanHT-12 V3.0 expression beadchip reading data, then you should be aware that this is for Advanced users (apologies for assuming that you may not be advanced).

If possible, try to retrieve the data via GEO2R ( see blue button on the GEO main accession page: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE140830 )

ADD REPLY
0
Entering edit mode

Thank you Kevin i don't aimed to find DEGs from this dataset, i intend to extract a normalized expression matrix to test the preservation of the main modules of my wgcna. the supplementary file of dataset GSE140830 contains GSE140830_final_normalized_data.txt.gz and GSE140830_raw_data.txt.gz.
i'm following this code:

library(GEOquery)
filePaths = getGEOSuppFiles("GSE140830")
datFTD<-read.delim("GSE140830_raw_data.txt")
dim(datFTD)
#47231  2181
targetinfo<-datFTD[,1:13]
exp<-datFTD[,c(1,14:2181)]
exp<-exp[,order(names(exp))]
names(exp)<-toupper(names(exp))
exp<-exp[,-grep("BEAD",names(exp))]
dim(exp)
#47231  1085
names(exp)<-gsub("X","",names(exp))
rownames(exp)<-exp[,1]
exp<-exp[,-1]
annot<-cbind(targetinfo,exp)
dim(annot)
#47231  1097
Exdata<-exp[,grep("SIGNAL",names(exp))]
Pvdata<-exp[,grep("DETECT",names(exp))]
dim(Exdata)
#47231   542
dim(Pvdata)
#47231   542

project <- new('EListRaw')
project@.Data[[1]] <- 'illumina'
project@.Data[[2]] <- targetinfo
project@.Data[[3]] <- annot
project@.Data[[4]] <- Exdata
project@.Data[[5]] <- NULL
project$E <- Exdata
project$targets <- targetinfo
project$genes <- annot
project$other$Detection <- Pvdata

project.bgcorrect.norm <- neqc(project, offset = 16)

As you assumed, I'm not an advanced user, I am PhD candidate in neuroscience and i know bioinformatics as much as doing my thesis. I will be appreciated if you could help me.

ADD REPLY
0
Entering edit mode

Thanks, I have taken a look and could reproduce the error with this code:

library(GEOquery)
filePaths = getGEOSuppFiles("GSE140830")
datFTD<-read.delim(rownames(filePaths)[2])
dim(datFTD)

targetinfo<-datFTD[,1:13]
exp<-datFTD[,c(1,14:2181)]
exp<-exp[,-grep("BEAD_STDERR",names(exp))]
dim(exp)

names(exp)<-gsub("X","",names(exp))
rownames(exp)<-exp[,1]
exp<-exp[,-1]
annot<-cbind(targetinfo)
dim(annot)

Exdata<-exp[,grep("SIGNAL",names(exp))]
Pvdata<-exp[,grep("DETECT",names(exp))]
dim(Exdata)
dim(Pvdata)

project <- new('EListRaw')
  project@.Data[[1]] <- 'illumina'
  project@.Data[[2]] <- targetinfo
  project@.Data[[3]] <- annot
  project@.Data[[4]] <- Exdata
  project@.Data[[5]] <- NULL
  project$E <- Exdata
  project$targets <- targetinfo
  project$genes <- annot
  project$other$Detection <- Pvdata

project.bgcorrect.norm <- neqc(project, offset = 16)

Looking closer, it seems that the authors have not provided all control probes with the raw data; so, the normalisation cannot function. I can only see 1 control probe included in the data via table(project$genes#SOURCE). It seems, according to here https://support.bioconductor.org/p/42502/, that you can still normalise it without control probes via:

project.bgcorrect.norm <- backgroundCorrect(project, method = 'normexp')

----------------

I checked the data via GEO2R and the rownames are non-sensical characters - not sure what happened there.

Another option may be to load the already-normalised data that is also available on GEO: GSE140830_final_normalized_data.txt.gz This, at least, appears to be in a usable format, and de-necessitates all of the above complicated code. It can be inferred via the GEO record that this data is normalised as "quantile and VST normalized signal"

ADD REPLY
1
Entering edit mode

Thank you Kevin I appreciate it.

ADD REPLY
0
Entering edit mode

Hi, I have been following this thread. Like you, am new to such analysis. Since the data is from Illumina beadchip, it is difficult for mean to perform normalization and post processing. Can you help me with the R script that u finally used?

TIA Shriyansh

ADD REPLY
1
Entering edit mode
13 months ago
Gordon Smyth ★ 7.7k

The GSE119600 Illumina Beadchip data can be read and processed in a straightforward manner using limma package functions.

Read expression data and detection p-values:

> library(limma)
> x <- read.ilmn("GSE119600_non-normalized.txt.gz",probeid="ID_REF")
Reading file GSE119600_non-normalized.txt.gz ... ...

Background correct and quantile normalize using detection p-values. Note that limma does not require the control probe expression values because it is able to infer the mean and variance of the control probes from the detection p-values (functionality that was added to the neqc function in October 2010).

> y <- neqc(x)
Note: inferring mean and variance of negative control probe intensities from the detection p-values.

Parse the sample annotation out of the series matrix file.

> SampleInfo <- sampleInfoFromGEO("GSE119600_series_matrix.txt.gz")$SampleInfo
> Group <- SampleInfo[,"source_name_ch1"]
> table(Group)
Group
                       Control, adult                Crohn’s disease, adult 
                                   47                                    48 
               Crohn’s disease, child    Primary biliary cholangitis, adult 
                                   47                                    90 
Primary sclerosing cholangitis, adult             Ulcerative colitis, adult 
                                   45                                    45 
            Ulcerative colitis, child 
                                   48

The data is now ready for linear modeling.

ADD COMMENT
0
Entering edit mode

Very nice Gordon. Was this function a recent addition to limma or has it always been there? I had not previously seen it.

ADD REPLY
2
Entering edit mode

The sampleInfoFromGEO() function was new earlier this year in the Bioconductor 3.17 release. I wrote it to handle irregular annotation cases that cause other functions to fail.

ADD REPLY

Login before adding your answer.

Traffic: 1097 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6