Hello all
It's just a suggestion meant to help anyone new in the field like me.
Would it be possible of set up an open study group (like a huge, ongoing workshop), using Dr Istvan's platform, if he's agreeable, for reproducing experiments from research papers, with increasing levels difficulty and cover various sub-disciplines (such as expression, variation, structural, etc), and that would involve scripting tools like python, shell and R and the freely-available tools, like EMBOSS, phylip, meme, etc...
The difference from ROSALIND would be that the problems would not be algorithm-centered, but would foster application/ learning of current tools, packages from the tools I mentioned previously. Study groups could use pre-existing tutorials/ learning materials or benefit from courses from COURSERA to complement..
Well, that's just an idea..
Hope to get some comments, or redirection if such a platform already exists.
Thanks
I would be interested in this. In particular I am interested in reproducing a number of research articles based on public databases (A: Can pre-existing microarray datasets be re-used to make new discoveries/ assumptions?), because I find that to be such an interesting area...
edit: I just noticed that that was originally your question that I referred to :)
That's the kind of analysis I'd really like to learn about. It's kind of tough journey to go alone. We could generate research ideas along the way!!
Hi jbinv. Just to let you know. I've already extracted the gene names/ probe ids and uploaded them for annotation at AILUN. The head ache now is to get the mean values for the duplicated probe ids... Interesting, at least for me. I previously thought that the Geo Dataset probeset ids were unique... Where are you at?
I haven't gotten quite that far yet, still reading articles. Thinking of spending some time this weekend working on this specific project. AILUN looks interesting, but I'm not sure I completely understand its value...
It uses the probe ids/ gene names to give annotations in this format:
Probe ID Entrez Gene ID Gene Symbol Description
1000_at 5595 MAPK3 mitogen-activated protein kinase 3
I've extracted the ids/ names with awk. If you have any interesting alternative, please share. Thanks for the reply.
I'm not familiar enough with unix commands to know how to do this. I would do it by importing it into R/BioConductor, followed by extracting the appropriate column. Your method certainly sounds simpler though, would you mind sharing the command that you used?
Update on analysis. I've made several changes - So I'm editing this post itself to avoid confusion.
Preliminary data cleaning in terminal
#Extract data between beginning and end of dataset
grep -n _end GDS2642_full.soft
grep -n _begin GDS2642_full.soft
head -12754 GDS2642_full.soft|tail -12626 > temp_dataset
Run in the following order: script1, script2, script3 (see below)
The final plot and results will go to a folder "R.analysis"
Gives all differentially expressed genes (up & down-regulated mixed).
To do further analysis on cmap, we will need them as different files...but I'm not sure of the steps yet.
Note. my mistake. No need for annotation concatenation in the initial steps. Got to figure out all of these with R.. I can send the files, if the format doesn't show well in the textboxes. I've contacted the author, to clear some doubts/ misunderstandings I have before I go ahead with cmap.
Thanks! Could you share the python script that you used to get to that point as well?
The last column containing the annotation might be a bit complicating, but I think it should be straightforward enough to add that annotation after we've taken the means of such duplicates in R. I wrote the following script for the means (it's not very efficient, I admit, it's just a working solution):
Although, once we have the annotations, wouldn't it be better to work directly from the CEL files?
I am up for it. I guess you will get a huge response from many communities from different countries.
To anyone interested - Please let me know if anyone gave the microarray analysis a try? I'm at the annotation part and got to figure out how to get the mean values... May be we could make constructive comments or formulate questions to ask on the platform here. Thanks.
It seems that I am also at annotation part :) :) Did you able to know about getting Means values ? Can you please post first few lines of your annotation files as I am bit confused for concatenating actually
My approach is very simple till now:
grep -v ^[!\^#] GDS2642_full.soft | awk '/ /{print $1}' > check.txt
Then I use this file for ailun and got this result:
http://ailun.stanford.edu/upload/91_343580876.txt
and one another similar file too
let me know if I am on the right path actually !!
By the way, I did't get any duplicate probe_id here from annotated one with ailun:
awk '/ /{print $1}' 8300_499418979.txt | uniq -d
Its returning no values hence no duplicates also
I redid the mean/ annotation steps. In the paper the author mentions doing the annotation before computing the means. I did this at first, but I believe it would be easier to get the unique probe id list first, before annotating. I did not concatenatete annotations to the expression values - I'm trying Tmev (http://www.tm4.org/mev_manual/sam.html) to do the SAM analysis. Seems the annotation file can be loaded separately there, but somehow I'm not finding options to see where they're found..unless it's a format problem.
Your certainly are a great awk ninja. :-). But, I believe you'll get redundant rows of gene annotations, like I did before, unless it's unimportant. Have you been able to get the means? That's what I got, using my list of unique gene names at AILUN:
Probe ID Entrez Gene ID Gene Symbol Description
1000_at 5595 MAPK3 mitogen-activated protein kinase 3
1001_at 7075 TIE1 tyrosine kinase with immunoglobulin-like and EGF-like domains 1
1002_f_at 1557 CYP2C19 cytochrome P450, family 2, subfamily C, polypeptide 19
1003_s_at 643 CXCR5 chemokine (C-X-C motif) receptor 5
Thanks for your contribution.
I didn't calcuate mean yet !! Will work on this week !!! Was busy today for some other's stuffs !! I shall keep you posted .
The probe ids are unique. It's the genes that are duplicated (some up to 7 times). Have a look at rows 4 and 5 for e.g.
Guys, I contacted Dr Istvan and we might get a new post type for the study group, but it will only happen if lots of people are interested with the idea and are actively participating. We are currently working on this paper focussed on mining public microarray data: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3479650/pdf/nihms408530.pdf
jobinv and I have already tried and posted a few scripts (R, python and bash).
Any ideas are welcome.
Thanks
script1.py
#!/usr/bin/python
import os
''' Removing extraneous annotations, to keep only probe id, gene name and expresion values '''
os.chdir('/home/olivier/bioinformatics/microarray_tests/Crohns-disease-and-ulcerative-colitis/')
script2.py
#Notes:
#The different controls have same names but differing probe ids, but have been grouped together.
#One probe id (first instance) as been retained for duplicate gene entries.
script3.R
load script in R
This the first R script I've written to completion. Grateful if anyone could comment.
Note that I'm not very sure of how the author compared the samples. So I compared all combinations (affected, normal, unaffected) using "2-class unpaired SAM". Nevertheless, I got the gene "IFI30", but not "TRPV1" by fiddling with the delta parameter...
So if anyone has an idea, you're welcome to comment on that too.
setwd('/home/olivier/bioinformatics/microarray_tests/Crohns-disease-and-ulcerative-colitis/')
df <- read.table('cleaned_dataset_with_means',sep='\t',header=TRUE)
rownames(df) <- df[,2] #Specify gene names as row names
df <- df[,3:ncol(df)] #getting rid of probe id and gene name columns
healthy <- c('GSM155564','GSM155565','GSM155566','GSM155567') #specimen
unaffected <- c('GSM155575','GSM155576','GSM155577','GSM155578','GSM155579','GSM155580', 'GSM155581','GSM155582','GSM155583','GSM155584','GSM155585','GSM155586', 'GSM155592','GSM155593','GSM155594','GSM155595','GSM155599') #specimen
affected <- c('GSM155568','GSM155569','GSM155570','GSM155571','GSM155572','GSM155573', 'GSM155574','GSM155587','GSM155588','GSM155589','GSM155590','GSM155591', 'GSM155596','GSM155597','GSM155598') #specimen
library('siggenes')
#Function which selects from healthy, affected or unnaffected states. Returns value of SAM analysis.
#cl is a required sam argument defining column names as a vector
label.and.sam.analysis <- function(x,y) { x <- df[,x] #selecting first argument as columns from df y <- df[,y] x.vs.y <- cbind(x,y) #Place the sample columns in a new dataframe cl.x.vs.y <- c(rep(0, ncol(x)), rep(1, ncol(y))) #converting columns to 0's and 1's in a vector sam.x.vs.y <- sam(x.vs.y, cl.x.vs.y) return(sam.x.vs.y) }
#SAM test to retrieve significanlty differing gene expression data
#Each of the 3 variable names can be plotted, separately to view delta vs FDR & Delta vs sig genes
sam1 <- label.and.sam.analysis(healthy, affected)
sam2 <- label.and.sam.analysis(healthy, unaffected)
sam3 <- label.and.sam.analysis(affected, unaffected)
#Drawing SAM plots to visualize up/ down-regulated genes. Check "./R.analysis/R.analysis.jpg" in current directory
jpeg(filename='./R.analysis/R.analysis2.jpg',width = 1200, height = 800, quality=100, pointsize=17)
par(mfrow=c(2,2)) #All 3 plots go to 1 page
plot(sam1,2.3, main='sam.healthy.vs.affected')
plot(sam2,2.1, main='sam.healthy.vs.unaffected')
plot(sam3,2.2, main='sam.affected.vs.unaffected')
dev.off()
#Shows selected genes and associated statistics for each gene
#The second arguments are the delta values I chose. I hope they're good
s1 <- summary(sam1, 2.3)
s2 <- summary(sam2, 2.1)
s3 <- summary(sam2, 2.2)
#Writing genes (up/down-regulated) to "./R.analysis/"
write.table(rownames(s1@mat.sig),'./R.analysis/sig.genes.sam.healthy.vs.affected.delta2.3',quote=FALSE,row.names=FALSE,col.names=FALSE)
write.table(rownames(s2@mat.sig),'./R.analysis/sig.genes.sam.healthy.vs.unaffected.delta2.1',quote=FALSE,row.names=FALSE,col.names=FALSE)
write.table(rownames(s3@mat.sig),'./R.analysis/sig.genes.sam.affected.vs.unaffected.delta2.2',quote=FALSE,row.names=FALSE,col.names=FALSE)
Not finished.. Work in progress.
Interesting information:
One of the genes reported by the author and tested by qPCR (TRPV1) is not found in the original dataset (GDS2642).
Here's an insightful comment there: How to retrieve a gene if its name has changed in a GDS?
Connectivity Map: Dealing with incompatible affy ids..
C: Connectivity Map - unable to load probe ids
Hey guys
I need an idea here. I've created a google group, but I fear this way will prevent other biostars members from viewing the work, right? Any ideas as to how we can make that possible? Please send me your email address (private msg) if anyone's interested so I can add you to the group.
Regards.
I have no experience with Google Groups, but isn't it possible to make it into an open group that one can either join directly, or request membership to?