Merging Datasets Together Using Excel
5
0
Entering edit mode
10.7 years ago
Vanceed ▴ 30

Hello all,

I am currently a PhD student and I am looking at datasets from GEO and trying to merge them together and starting to analyze overlapping genes. Does anyone have input onto how to do this?

microarray • 5.8k views
ADD COMMENT
12
Entering edit mode

enter image description here

ADD REPLY
1
Entering edit mode

You will have to show us some lines from the files and tell us what exactly you are trying to accomplish.

ADD REPLY
0
Entering edit mode

You need to provide far more detail about what, precisely, you want to do and the type of data involved.

ADD REPLY
1
Entering edit mode
10.7 years ago
Irsan ★ 7.8k

If you want to use excel for this, google for VLOOKUP function. If you are working in a unix-like environment, use join command. Have a look at this paper why you shouldn't use excel. Personally I think you can use excel for certain bio-informatics tasks but you have to be very careful with data types. Unfortunately the people that want to use excel are usually not so careful/aware of all the things that can go wrong

ADD COMMENT
0
Entering edit mode

I think pasting gene names as "text" in excel should take care of most of the problems. Sometimes, excel is handy if you do it carefully.

ADD REPLY
2
Entering edit mode

But as Irsan says, vast majority of users do not do it carefully. The point-and-click mentality does not encourage thoughtful reflection.

ADD REPLY
0
Entering edit mode
10.7 years ago
Biojl ★ 1.7k

I would export them as tsv or csv and run the command join from a linux/mac command line.

ADD COMMENT
0
Entering edit mode
10.7 years ago
fatstrat389 ▴ 30

I think it would be prudent for you to pick up MySQL. SQL is (kinda) like a spreadsheet but it enforces data types and may help you avoid the aforementioned pitfalls of excel.

ADD COMMENT
0
Entering edit mode
10.6 years ago
alaincoletta ▴ 170

InSilico DB has a "merging" R-Bioconductor package to combine public datasets from GEO and their clinical annotations. If you are not using R you can also combine data from the online platform (See this short step-by-step tutorial

Example:

# Retrieve 2 datasets
eset1 = getDataset(gse="GSE10072", gpl="GPL96", norm="ORIGINAL", genes=TRUE);
eset2 = getDataset(gse="GSE7670", gpl="GPL96", norm="ORIGINAL", genes=TRUE);

#combine them
esets = list(eset1, eset2);
eset = merge(esets, method="NONE");

#plot them
plotMDS(eset, targetAnnot="Disease", batchAnnot="Study");

InSilico DB packaged various batch removal effects methods so line 4 could be replaced with:

eset = merge(esets, method="XPN");

or

eset = merge(esets, method="COMBAT");

Hope this helps.

For more info Bioinformatics paper reference; InSilico DB and InSIlico Merging packages links, and blog link.

-Tutorial example: https://insilicodb.org/the-impact-of-batch-effects-when-merging-different-data-sets/

R-Bioconductor packages: http://www.bioconductor.org/packages/2.12/bioc/html/inSilicoDb.html and http://www.bioconductor.org/packages/2.12/bioc/html/inSilicoMerging.html

ADD COMMENT
0
Entering edit mode
10.6 years ago

Save your Excel sheet from each sample as a comma-separated file in a given folder within a "Main_Folder" folder. The contents of the file should be [Feature],[value] and folder name be the sample name. All csv file should share a common string as a name (e.g. *_data.csv) for us to grep them.

Then use my http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/collateResults.pl in the "Main_Folder" as

perl collateResults.pl -f . -p _data.csv

and it should merge the data together

So say you have

BEFORE:
Main_Folder/Folder_1/*_data.csv:
species_A,2
species_B,4
species_C,5

Main_Folder/Folder_2/*_data.csv:
species_A,3
species_D,5

AFTER:
Samples,Folder_1,Folder_2
species_A,2,3
species_B,4,0
species_C,5,0
species_D,0,5

Alternatively If you can produce records of this form: [Sample_Name]\t[Feature]\t[Value] then use my GENERATEtable.sh script

$ cat test.tsv
contig1 F1 12.2
contig1 F2 34.2
contig1 F3 45.2
contig2 F2 56.3
contig2 F3 56.2
contig3 F1 45.4
contig3 F2 56.3
contig4 F1 23.5
contig5 F1 24.5
$ cat GENERATEtable.sh
#!/bin/bash
less <&0| \
perl -ane '$r{$F[0].":".$F[1]}=$F[2];
  unless($F[0]~~@s){
   push @s,$F[0];}
  unless($F[1]~~@m){
   push @m,$F[1];}
END{
print "Contigs\t".join("\t",@s)."\n";
for($i=0;$i<@m;$i++){
  print $m[$i];
  for($j=0;$j<@s;$j++){
   (not defined $r{$s[$j].":".$m[$i]})?print "\t".0:print"\t".$r{$s[$j].":".$m[$i]};}
  print "\n";}}' 
$ cat test.tsv | ./GENERATEtable.sh
Contigs contig1 contig2 contig3 contig4 contig5
F1 12.2 0 45.4 23.5 24.5
F2 34.2 56.3 56.3 0 0
F3 45.2 56.2 0 0 0

Best Wishes,
Umer

ADD COMMENT

Login before adding your answer.

Traffic: 1750 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6