Hello all,
I am currently a PhD student and I am looking at datasets from GEO and trying to merge them together and starting to analyze overlapping genes. Does anyone have input onto how to do this?
Hello all,
I am currently a PhD student and I am looking at datasets from GEO and trying to merge them together and starting to analyze overlapping genes. Does anyone have input onto how to do this?
If you want to use excel for this, google for VLOOKUP function. If you are working in a unix-like environment, use join command. Have a look at this paper why you shouldn't use excel. Personally I think you can use excel for certain bio-informatics tasks but you have to be very careful with data types. Unfortunately the people that want to use excel are usually not so careful/aware of all the things that can go wrong
I think it would be prudent for you to pick up MySQL. SQL is (kinda) like a spreadsheet but it enforces data types and may help you avoid the aforementioned pitfalls of excel.
InSilico DB has a "merging" R-Bioconductor package to combine public datasets from GEO and their clinical annotations. If you are not using R you can also combine data from the online platform (See this short step-by-step tutorial
Example:
# Retrieve 2 datasets
eset1 = getDataset(gse="GSE10072", gpl="GPL96", norm="ORIGINAL", genes=TRUE);
eset2 = getDataset(gse="GSE7670", gpl="GPL96", norm="ORIGINAL", genes=TRUE);
#combine them
esets = list(eset1, eset2);
eset = merge(esets, method="NONE");
#plot them
plotMDS(eset, targetAnnot="Disease", batchAnnot="Study");
InSilico DB packaged various batch removal effects methods so line 4 could be replaced with:
eset = merge(esets, method="XPN");
or
eset = merge(esets, method="COMBAT");
Hope this helps.
For more info Bioinformatics paper reference; InSilico DB and InSIlico Merging packages links, and blog link.
-Tutorial example: https://insilicodb.org/the-impact-of-batch-effects-when-merging-different-data-sets/
R-Bioconductor packages: http://www.bioconductor.org/packages/2.12/bioc/html/inSilicoDb.html and http://www.bioconductor.org/packages/2.12/bioc/html/inSilicoMerging.html
Save your Excel sheet from each sample as a comma-separated file in a given folder within a "Main_Folder" folder. The contents of the file should be [Feature],[value] and folder name be the sample name. All csv file should share a common string as a name (e.g. *_data.csv) for us to grep them.
Then use my http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/collateResults.pl in the "Main_Folder" as
perl collateResults.pl -f . -p _data.csv
and it should merge the data together
So say you have
BEFORE:
Main_Folder/Folder_1/*_data.csv:
species_A,2
species_B,4
species_C,5
Main_Folder/Folder_2/*_data.csv:
species_A,3
species_D,5
AFTER:
Samples,Folder_1,Folder_2
species_A,2,3
species_B,4,0
species_C,5,0
species_D,0,5
Alternatively If you can produce records of this form: [Sample_Name]\t[Feature]\t[Value]
then use my GENERATEtable.sh script
$ cat test.tsv
contig1 F1 12.2
contig1 F2 34.2
contig1 F3 45.2
contig2 F2 56.3
contig2 F3 56.2
contig3 F1 45.4
contig3 F2 56.3
contig4 F1 23.5
contig5 F1 24.5
$ cat GENERATEtable.sh
#!/bin/bash
less <&0| \
perl -ane '$r{$F[0].":".$F[1]}=$F[2];
unless($F[0]~~@s){
push @s,$F[0];}
unless($F[1]~~@m){
push @m,$F[1];}
END{
print "Contigs\t".join("\t",@s)."\n";
for($i=0;$i<@m;$i++){
print $m[$i];
for($j=0;$j<@s;$j++){
(not defined $r{$s[$j].":".$m[$i]})?print "\t".0:print"\t".$r{$s[$j].":".$m[$i]};}
print "\n";}}'
$ cat test.tsv | ./GENERATEtable.sh
Contigs contig1 contig2 contig3 contig4 contig5
F1 12.2 0 45.4 23.5 24.5
F2 34.2 56.3 56.3 0 0
F3 45.2 56.2 0 0 0
Best Wishes,
Umer
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
You will have to show us some lines from the files and tell us what exactly you are trying to accomplish.
You need to provide far more detail about what, precisely, you want to do and the type of data involved.