Forum:Bioinformatics Study Group
4
17
Entering edit mode
11.4 years ago
Olivier ▴ 440

Hello all

It's just a suggestion meant to help anyone new in the field like me.

Would it be possible of set up an open study group (like a huge, ongoing workshop), using Dr Istvan's platform, if he's agreeable, for reproducing experiments from research papers, with increasing levels difficulty and cover various sub-disciplines (such as expression, variation, structural, etc), and that would involve scripting tools like python, shell and R and the freely-available tools, like EMBOSS, phylip, meme, etc...

The difference from ROSALIND would be that the problems would not be algorithm-centered, but would foster application/ learning of current tools, packages from the tools I mentioned previously. Study groups could use pre-existing tutorials/ learning materials or benefit from courses from COURSERA to complement..

Well, that's just an idea..

Hope to get some comments, or redirection if such a platform already exists.

Thanks

study-group • 6.5k views
ADD COMMENT
1
Entering edit mode

I would be interested in this. In particular I am interested in reproducing a number of research articles based on public databases (A: Can pre-existing microarray datasets be re-used to make new discoveries/ assumptions?), because I find that to be such an interesting area...

edit: I just noticed that that was originally your question that I referred to :)

ADD REPLY
0
Entering edit mode

That's the kind of analysis I'd really like to learn about. It's kind of tough journey to go alone. We could generate research ideas along the way!!

ADD REPLY
0
Entering edit mode

Hi jbinv. Just to let you know. I've already extracted the gene names/ probe ids and uploaded them for annotation at AILUN. The head ache now is to get the mean values for the duplicated probe ids... Interesting, at least for me. I previously thought that the Geo Dataset probeset ids were unique... Where are you at?

ADD REPLY
0
Entering edit mode

I haven't gotten quite that far yet, still reading articles. Thinking of spending some time this weekend working on this specific project. AILUN looks interesting, but I'm not sure I completely understand its value...

ADD REPLY
0
Entering edit mode

It uses the probe ids/ gene names to give annotations in this format:

Probe ID Entrez Gene ID Gene Symbol Description
1000_at 5595 MAPK3 mitogen-activated protein kinase 3

I've extracted the ids/ names with awk. If you have any interesting alternative, please share. Thanks for the reply.

ADD REPLY
0
Entering edit mode

I'm not familiar enough with unix commands to know how to do this. I would do it by importing it into R/BioConductor, followed by extracting the appropriate column. Your method certainly sounds simpler though, would you mind sharing the command that you used?

ADD REPLY
1
Entering edit mode

Update on analysis. I've made several changes - So I'm editing this post itself to avoid confusion.
Preliminary data cleaning in terminal
#Extract data between beginning and end of dataset
grep -n _end GDS2642_full.soft
grep -n _begin GDS2642_full.soft
head -12754 GDS2642_full.soft|tail -12626 > temp_dataset

Run in the following order: script1, script2, script3 (see below)
The final plot and results will go to a folder "R.analysis"
Gives all differentially expressed genes (up & down-regulated mixed).
To do further analysis on cmap, we will need them as different files...but I'm not sure of the steps yet.
Note. my mistake. No need for annotation concatenation in the initial steps. Got to figure out all of these with R.. I can send the files, if the format doesn't show well in the textboxes. I've contacted the author, to clear some doubts/ misunderstandings I have before I go ahead with cmap.

ADD REPLY
1
Entering edit mode

Thanks! Could you share the python script that you used to get to that point as well?

The last column containing the annotation might be a bit complicating, but I think it should be straightforward enough to add that annotation after we've taken the means of such duplicates in R. I wrote the following script for the means (it's not very efficient, I admit, it's just a working solution):

# Import data
dataset.raw <- read.table("testdata",sep="\t")

# Naming first two columns (should ideally name the other ones as well)
names(dataset.raw)[1:2] <- c("probeid","geneid")

# Finding the names of the unique genes
genes <- unique(dataset.raw$geneid)

# We want to make a new empty dataset containing one row for each individual
# gene name, and one column for each sample. just individual gene names. However,
# to make it generic for later use, I'll choose not to make assumptions about the
# number of columns in our dataset
n.samples <- dim(dataset.raw)[2]-2
n.genes <- length(genes)

dataset <- matrix(nrow=n.genes,ncol=n.samples+1)
dataset <- as.data.frame(dataset)
names(dataset) <- names(dataset.raw)[-1]

# Next, we fill in the gene names in the first column
dataset$geneid <- genes

# Finally, we fill in the expression values, taking into account that there could
# be duplicates, and taking the means of these
for (sample in 1:n.samples) {
    for (gene in genes) {
            dataset[dataset$geneid==gene,sample+1] <- mean(dataset.raw[dataset.raw$geneid==gene,sample+2])
    }
}

Although, once we have the annotations, wouldn't it be better to work directly from the CEL files?

ADD REPLY
1
Entering edit mode

I am up for it. I guess you will get a huge response from many communities from different countries.

ADD REPLY
0
Entering edit mode

To anyone interested - Please let me know if anyone gave the microarray analysis a try? I'm at the annotation part and got to figure out how to get the mean values... May be we could make constructive comments or formulate questions to ask on the platform here. Thanks.

ADD REPLY
0
Entering edit mode

It seems that I am also at annotation part :) :) Did you able to know about getting Means values ? Can you please post first few lines of your annotation files as I am bit confused for concatenating actually

My approach is very simple till now:

grep -v ^[!\^#] GDS2642_full.soft | awk '/ /{print $1}' > check.txt

Then I use this file for ailun and got this result:

http://ailun.stanford.edu/upload/91_343580876.txt

and one another similar file too

let me know if I am on the right path actually !!

By the way, I did't get any duplicate probe_id here from annotated one with ailun:

awk '/ /{print $1}' 8300_499418979.txt | uniq -d

Its returning no values hence no duplicates also

ADD REPLY
0
Entering edit mode

I redid the mean/ annotation steps. In the paper the author mentions doing the annotation before computing the means. I did this at first, but I believe it would be easier to get the unique probe id list first, before annotating. I did not concatenatete annotations to the expression values - I'm trying Tmev (http://www.tm4.org/mev_manual/sam.html) to do the SAM analysis. Seems the annotation file can be loaded separately there, but somehow I'm not finding options to see where they're found..unless it's a format problem.
Your certainly are a great awk ninja. :-). But, I believe you'll get redundant rows of gene annotations, like I did before, unless it's unimportant. Have you been able to get the means? That's what I got, using my list of unique gene names at AILUN:
Probe ID Entrez Gene ID Gene Symbol Description
1000_at 5595 MAPK3 mitogen-activated protein kinase 3
1001_at 7075 TIE1 tyrosine kinase with immunoglobulin-like and EGF-like domains 1
1002_f_at 1557 CYP2C19 cytochrome P450, family 2, subfamily C, polypeptide 19
1003_s_at 643 CXCR5 chemokine (C-X-C motif) receptor 5

Thanks for your contribution.

ADD REPLY
0
Entering edit mode

I didn't calcuate mean yet !! Will work on this week !!! Was busy today for some other's stuffs !! I shall keep you posted .

ADD REPLY
0
Entering edit mode

The probe ids are unique. It's the genes that are duplicated (some up to 7 times). Have a look at rows 4 and 5 for e.g.

ADD REPLY
0
Entering edit mode

Guys, I contacted Dr Istvan and we might get a new post type for the study group, but it will only happen if lots of people are interested with the idea and are actively participating. We are currently working on this paper focussed on mining public microarray data: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3479650/pdf/nihms408530.pdf
jobinv and I have already tried and posted a few scripts (R, python and bash).
Any ideas are welcome.
Thanks

ADD REPLY
0
Entering edit mode

script1.py
#!/usr/bin/python
import os

''' Removing extraneous annotations, to keep only probe id, gene name and expresion values '''

os.chdir('/home/olivier/bioinformatics/microarray_tests/Crohns-disease-and-ulcerative-colitis/')

`#Remove extraneous annotations from the temp_dataset generated by awk, write cleaned_dataset_with_duplicate_gene_names`

`with open('temp_dataset','r') as temp_dataset_handle:  
    records = temp_dataset_handle.readlines()  
    records = [i.split('\t') for i in records] `  

`#Selecting columns of gene names and expresion values only, from each record`  
`def pid_gene_and_value_selection(x):   
    return '\t'.join(x[:38])`  

`cleaned_records = '\n'.join(map(pid_gene_and_value_selection, records)) #Add a newline after each record parsed`  

`with open('GDS_cleaned','w') as GDS_cleaned_handle: 
    GDS_cleaned_handle.write(cleaned_records)`  

`#Creates file 'GDS_cleaned`
ADD REPLY
0
Entering edit mode

script2.py

`#!/usr/bin/python`  

`import os`  
`import numpy as np`  

'`''Get the means for duplicate genes from GDS_cleaned'''`  

`os.chdir('/home/olivier/bioinformatics/microarray_tests/Crohns-disease-and-ulcerative-colitis')`

`with open('GDS_cleaned', 'r') as gds_cleaned_handle:  
    header = gds_cleaned_handle.readline()  
    records = gds_cleaned_handle.readlines()`  

`#Parse records elements, comprising probe_id, gene name and expression values into numpy arrays`  
`parsed_record_elements = np.array([elem.rstrip('\n').split('\t') for elem in records]) #Create an array for each record, i.e. probe id, gene name, expression values`  

`#Extract list of all genes`  
`def extract_gene(rec):  
    return rec[1]`  
`gene_list =map(extract_gene, parsed_record_elements)`  

`#Create list of unique genes`  
`unique_genes = list(set(gene_list))`  

`if os.path.exists('cleaned_dataset_with_means'):  
    os.remove('cleaned_dataset_with_means')`

`#Testing whether to take means or not and writing output`  
`with open('cleaned_dataset_with_means','a') as output:  
    output.write(header)  
    for gene in unique_genes:  
        if gene_list.count(gene) == 1:   
            rec = parsed_record_elements[parsed_record_elements[:,1] == gene]#select rows (from gene name to expression values) where column==unique gene.  
            output.write('\t'.join(rec[0]) + '\n') #write down record`  

`        else:  
            temp_rec = parsed_record_elements[parsed_record_elements[:,1] == gene] #select record from records where gene is duplicated. Append to temp list  
            expr_vals = [map(float,i) for i in temp_rec[:,2:]] #Converting expression values to floats, so means can be taken  
            probe_id_and_name = temp_rec[:,:2][0].tolist() #temp_rec[:,:2][0] selects first 2 elements (probe id, gene name) from all rows (records)   
            mean_expr_vals = np.mean(expr_vals, axis=0) #carry out column means for duplicated genes  
            new_record = '\t'.join(map(str,probe_id_and_name) + map(str,mean_expr_vals.tolist()))+'\n' #Prepare/format record with new values  
            output.write(new_record)`  

`#removing temp file`  
`os.remove('GDS_cleaned')`  
`os.remove('temp_dataset')`

#Notes:
#The different controls have same names but differing probe ids, but have been grouped together.
#One probe id (first instance) as been retained for duplicate gene entries.

ADD REPLY
0
Entering edit mode

script3.R
load script in R
This the first R script I've written to completion. Grateful if anyone could comment.
Note that I'm not very sure of how the author compared the samples. So I compared all combinations (affected, normal, unaffected) using "2-class unpaired SAM". Nevertheless, I got the gene "IFI30", but not "TRPV1" by fiddling with the delta parameter...
So if anyone has an idea, you're welcome to comment on that too.

setwd('/home/olivier/bioinformatics/microarray_tests/Crohns-disease-and-ulcerative-colitis/')
df <- read.table('cleaned_dataset_with_means',sep='\t',header=TRUE)

rownames(df) <- df[,2] #Specify gene names as row names
df <- df[,3:ncol(df)] #getting rid of probe id and gene name columns

healthy <- c('GSM155564','GSM155565','GSM155566','GSM155567') #specimen

unaffected <- c('GSM155575','GSM155576','GSM155577','GSM155578','GSM155579','GSM155580', 'GSM155581','GSM155582','GSM155583','GSM155584','GSM155585','GSM155586', 'GSM155592','GSM155593','GSM155594','GSM155595','GSM155599') #specimen

affected <- c('GSM155568','GSM155569','GSM155570','GSM155571','GSM155572','GSM155573', 'GSM155574','GSM155587','GSM155588','GSM155589','GSM155590','GSM155591', 'GSM155596','GSM155597','GSM155598') #specimen

library('siggenes')
#Function which selects from healthy, affected or unnaffected states. Returns value of SAM analysis.
#cl is a required sam argument defining column names as a vector

label.and.sam.analysis <- function(x,y) { x <- df[,x] #selecting first argument as columns from df y <- df[,y] x.vs.y <- cbind(x,y) #Place the sample columns in a new dataframe cl.x.vs.y <- c(rep(0, ncol(x)), rep(1, ncol(y))) #converting columns to 0's and 1's in a vector sam.x.vs.y <- sam(x.vs.y, cl.x.vs.y) return(sam.x.vs.y) }

#SAM test to retrieve significanlty differing gene expression data
#Each of the 3 variable names can be plotted, separately to view delta vs FDR & Delta vs sig genes
sam1 <- label.and.sam.analysis(healthy, affected)
sam2 <- label.and.sam.analysis(healthy, unaffected)
sam3 <- label.and.sam.analysis(affected, unaffected)

#Drawing SAM plots to visualize up/ down-regulated genes. Check "./R.analysis/R.analysis.jpg" in current directory
jpeg(filename='./R.analysis/R.analysis2.jpg',width = 1200, height = 800, quality=100, pointsize=17)
par(mfrow=c(2,2)) #All 3 plots go to 1 page
plot(sam1,2.3, main='sam.healthy.vs.affected')
plot(sam2,2.1, main='sam.healthy.vs.unaffected')
plot(sam3,2.2, main='sam.affected.vs.unaffected')
dev.off()

#Shows selected genes and associated statistics for each gene
#The second arguments are the delta values I chose. I hope they're good
s1 <- summary(sam1, 2.3)
s2 <- summary(sam2, 2.1)
s3 <- summary(sam2, 2.2)

#Writing genes (up/down-regulated) to "./R.analysis/"
write.table(rownames(s1@mat.sig),'./R.analysis/sig.genes.sam.healthy.vs.affected.delta2.3',quote=FALSE,row.names=FALSE,col.names=FALSE)
write.table(rownames(s2@mat.sig),'./R.analysis/sig.genes.sam.healthy.vs.unaffected.delta2.1',quote=FALSE,row.names=FALSE,col.names=FALSE)
write.table(rownames(s3@mat.sig),'./R.analysis/sig.genes.sam.affected.vs.unaffected.delta2.2',quote=FALSE,row.names=FALSE,col.names=FALSE)

Not finished.. Work in progress.

ADD REPLY
0
Entering edit mode

Interesting information:
One of the genes reported by the author and tested by qPCR (TRPV1) is not found in the original dataset (GDS2642).
Here's an insightful comment there: How to retrieve a gene if its name has changed in a GDS?

ADD REPLY
0
Entering edit mode

Connectivity Map: Dealing with incompatible affy ids..
C: Connectivity Map - unable to load probe ids

ADD REPLY
0
Entering edit mode

Hey guys

I need an idea here. I've created a google group, but I fear this way will prevent other biostars members from viewing the work, right? Any ideas as to how we can make that possible? Please send me your email address (private msg) if anyone's interested so I can add you to the group.
Regards.

ADD REPLY
0
Entering edit mode

I have no experience with Google Groups, but isn't it possible to make it into an open group that one can either join directly, or request membership to?

ADD REPLY
1
Entering edit mode
11.4 years ago
vijay ★ 1.6k

Kudos for this idea!!! I would love to be a part of this forum as well. If it is going to be a study group then I feel we shall start a new Google group, which also makes file sharing easy. This would indeed be more beneficial for new entrants in the field.

Hope this is a good mode to keep ourselves updated with the happenings in the field!!

ADD COMMENT
0
Entering edit mode
11.4 years ago

Hi olivier,

I am into the group.

Its a good idea to start a group. why dont we try to start with googlegroups or something. :)

ADD COMMENT
0
Entering edit mode

I was thinking about doing something, from this platform itself as we would be able to get help along the way, if that is possible.

ADD REPLY
0
Entering edit mode
11.4 years ago
Olivier ▴ 440

Could we get a few research themes of interest to you guys (and may be your abilities), so that we can have a little idea of how to plan, if we get the chance to go ahead:
"Mining public microarray data" - I have some working bio/python, shell, and a little R knowledge.

I hope this type of discussion is allowed on the forum.

ADD COMMENT
1
Entering edit mode

My abilities: quite adept with R, basic python and shell knowledge. The inverse of you, in other words :)

ADD REPLY
0
Entering edit mode

To anyone interested: could we collect this info (theme and/or skill) up to the 15th of July so that we can get a rough idea of where to start. Thanks.

ADD REPLY
0
Entering edit mode

we can start with Microarray datasets from GEO

ADD REPLY
0
Entering edit mode

By all means, stay around, though the group is not yet formal. We'll try the paper proposed by jobinv. Refer to later comments. thanks

ADD REPLY
0
Entering edit mode
11.4 years ago
jobinv ★ 1.1k

I would love to ultimately be able to reproduce the kind of method used in the papers from Atul Butte's lab, for example this one: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3479650/pdf/nihms408530.pdf

It's a brilliant study really, where they collected gene expression profiles for the disease (using GEO), and gene expression profiles for different drugs (using something called Connectivity Map). Then they assumed that if a drug led to the inverse gene expression pattern to that of the disease, then it could potentially be used therapeutically. They were then able to validate this in various ways later. It's worth reading, and in my opinion, definitely worth attempting to reproduce.

Like said, while I would love to be able to reproduce this at some point, I can't be certain that it is the best starting point. It looks straightforward enough, but it's easy to underestimate such things :)

ADD COMMENT
0
Entering edit mode

I will be interested in such group

ADD REPLY
0
Entering edit mode

Let's start with that then. Connectivity map seems to be very interesting - sad it's limited to cultured human cells only. I'll give the paper a good reading for a week or so. We could gather the data/ learn about the tools meanwhile. (Note - Being a molecular biologist, I hope you understand my limited stats abilities, but will give it my best shot). If you're interested there's a very interesting course on computational molecular evolution, I'm following from coursera. The knowledge might come in handy sooner or later... I've contacted Dr Istvan about the possibility of a study group, and we might get an answer mid July.

ADD REPLY
1
Entering edit mode

I am also interested. Basically I am a Bioinformatician with good skills in Perl, Python, Shell Scripting and Java !!

ADD REPLY
0
Entering edit mode

I'm interested in joining the study group too if it gets going. Really nice idea, thanks!

By the way, it'd be good to use some online project management tool that works smoothly. Let's look around for that too

ADD REPLY
1
Entering edit mode

You're welcome and are absolutely right on the management part. The objective is to keep the posts accessible for other biostar members to have a look at, and Dr Istvan proposed we use the forum posts for the moment to see how well it goes, before deciding whether to allocate us a slot on the platform. As I've said, the success of the group will depend on the visibility of our work on this platform via the number of members and number of likes, and may be the themes chosen. Thanks.

ADD REPLY
0
Entering edit mode

I am interested in joining this study group. I have followed some coursera (among which that on computational molecular evolution), but I am interested in practical use of bioinformatic tools; I am using python, and R.

ADD REPLY
0
Entering edit mode

You're more than welcome :) Olivier, who started the group, has gotten a bit further than myself with the analysis. I'm not sure how far the other people who were interested have gotten. Have you picked up on the details about which paper we're currently trying to replicate?

ADD REPLY
0
Entering edit mode

I have started reading the paper "Computational repositioning of the anticonvulsant topiramate for inflammatory bowel disease" and trying to start reproducing it on my own, together with looking at what you've done so far.

ADD REPLY

Login before adding your answer.

Traffic: 1710 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6