Question

Merging Datasets Together Using Excel

0

Entering edit mode

10.8 years ago

Vanceed ▴ 30

Hello all,

I am currently a PhD student and I am looking at datasets from GEO and trying to merge them together and starting to analyze overlapping genes. Does anyone have input onto how to do this?

microarray • 5.8k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 10.8 years ago by Vanceed ▴ 30

12

Entering edit mode

enter image description here

ADD REPLY • link 10.8 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

You will have to show us some lines from the files and tell us what exactly you are trying to accomplish.

ADD REPLY • link 10.8 years ago by Ashutosh Pandey 12k

0

Entering edit mode

You need to provide far more detail about what, precisely, you want to do and the type of data involved.

ADD REPLY • link 10.8 years ago by Neilfws 49k

score 1 · Answer 1 · 2014-03-19

1

Entering edit mode

10.8 years ago

Irsan ★ 7.8k

If you want to use excel for this, google for VLOOKUP function. If you are working in a unix-like environment, use join command. Have a look at this paper why you shouldn't use excel. Personally I think you can use excel for certain bio-informatics tasks but you have to be very careful with data types. Unfortunately the people that want to use excel are usually not so careful/aware of all the things that can go wrong

ADD COMMENT • link 10.8 years ago by Irsan ★ 7.8k

0

Entering edit mode

I think pasting gene names as "text" in excel should take care of most of the problems. Sometimes, excel is handy if you do it carefully.

ADD REPLY • link 10.8 years ago by Ashutosh Pandey 12k

2

Entering edit mode

But as Irsan says, vast majority of users do not do it carefully. The point-and-click mentality does not encourage thoughtful reflection.

ADD REPLY • link 10.8 years ago by Neilfws 49k

score 0 · Answer 2 · 2014-03-19

0

Entering edit mode

10.8 years ago

Biojl ★ 1.7k

I would export them as tsv or csv and run the command join from a linux/mac command line.

ADD COMMENT • link 10.8 years ago by Biojl ★ 1.7k

score 0 · Answer 3 · 2014-03-19

0

Entering edit mode

10.8 years ago

fatstrat389 ▴ 30

I think it would be prudent for you to pick up MySQL. SQL is (kinda) like a spreadsheet but it enforces data types and may help you avoid the aforementioned pitfalls of excel.

ADD COMMENT • link 10.8 years ago by fatstrat389 ▴ 30

Ram · Answer 4 · 2014-04-22

InSilico DB has a "merging" R-Bioconductor package to combine public datasets from GEO and their clinical annotations. If you are not using R you can also combine data from the online platform (See this short step-by-step tutorial

Example:

# Retrieve 2 datasets
eset1 = getDataset(gse="GSE10072", gpl="GPL96", norm="ORIGINAL", genes=TRUE);
eset2 = getDataset(gse="GSE7670", gpl="GPL96", norm="ORIGINAL", genes=TRUE);

#combine them
esets = list(eset1, eset2);
eset = merge(esets, method="NONE");

#plot them
plotMDS(eset, targetAnnot="Disease", batchAnnot="Study");

InSilico DB packaged various batch removal effects methods so line 4 could be replaced with:

eset = merge(esets, method="XPN");

or

eset = merge(esets, method="COMBAT");

Hope this helps.

For more info Bioinformatics paper reference; InSilico DB and InSIlico Merging packages links, and blog link.

-Tutorial example: https://insilicodb.org/the-impact-of-batch-effects-when-merging-different-data-sets/

R-Bioconductor packages: http://www.bioconductor.org/packages/2.12/bioc/html/inSilicoDb.html and http://www.bioconductor.org/packages/2.12/bioc/html/inSilicoMerging.html

Ram · Answer 5 · 2014-04-22

Save your Excel sheet from each sample as a comma-separated file in a given folder within a "Main_Folder" folder. The contents of the file should be [Feature],[value] and folder name be the sample name. All csv file should share a common string as a name (e.g. *_data.csv) for us to grep them.

Then use my http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/collateResults.pl in the "Main_Folder" as

perl collateResults.pl -f . -p _data.csv

and it should merge the data together

So say you have

BEFORE:
Main_Folder/Folder_1/*_data.csv:
species_A,2
species_B,4
species_C,5

Main_Folder/Folder_2/*_data.csv:
species_A,3
species_D,5

AFTER:
Samples,Folder_1,Folder_2
species_A,2,3
species_B,4,0
species_C,5,0
species_D,0,5

Alternatively If you can produce records of this form: [Sample_Name]\t[Feature]\t[Value] then use my GENERATEtable.sh script

$ cat test.tsv
contig1 F1 12.2
contig1 F2 34.2
contig1 F3 45.2
contig2 F2 56.3
contig2 F3 56.2
contig3 F1 45.4
contig3 F2 56.3
contig4 F1 23.5
contig5 F1 24.5
$ cat GENERATEtable.sh
#!/bin/bash
less <&0| \
perl -ane '$r{$F[0].":".$F[1]}=$F[2];
  unless($F[0]~~@s){
   push @s,$F[0];}
  unless($F[1]~~@m){
   push @m,$F[1];}
END{
print "Contigs\t".join("\t",@s)."\n";
for($i=0;$i<@m;$i++){
  print $m[$i];
  for($j=0;$j<@s;$j++){
   (not defined $r{$s[$j].":".$m[$i]})?print "\t".0:print"\t".$r{$s[$j].":".$m[$i]};}
  print "\n";}}' 
$ cat test.tsv | ./GENERATEtable.sh
Contigs contig1 contig2 contig3 contig4 contig5
F1 12.2 0 45.4 23.5 24.5
F2 34.2 56.3 56.3 0 0
F3 45.2 56.2 0 0 0

Best Wishes,
Umer