Hey this should get you started to aggregate the counts together, it may be a little quick and dirty since I am using cbind
and not merge
by genes, but this should get you started.
You first want to download the supplementary file here:
and then you can download that file "GSE162562_RAW.tar", extract it to your Desktop so you have the a GSE162562_RAW folder on your Desktop, then the following commands should aggregate the counts together.
#Un GZIP the count files
system("gunzip ~/Desktop/GSE162562_RAW/*.gz")
#get the list of sample names
GSMnames <- t(list.files("~/Desktop/GSE162562_RAW", full.names = F))
#remove .txt from file/sample names
GSMnames <- gsub(pattern = ".txt", replacement = "", GSMnames)
#make a vector of the list of files to aggregate
files <- list.files("~/Desktop/GSE162562_RAW", full.names = TRUE)
#check if there is the same number of rows in all samples
system("cd ~/Desktop/GSE162562_RAW | wc -l ~/Desktop/GSE162562_RAW/*.txt")
#there are 26369 rows so by extension there should be 26369 genes
#load the gene names up
genes <- read.table(files[1], header=FALSE, sep=",")[,1]
#make the raw aggregated data frame of all the counts
df <- do.call(cbind,lapply(files,function(fn)read.table(fn,header=FALSE, sep="\t")[,2]))
#bind it together with genes
df <- cbind(genes,df)
#change row names to gene names
row.names(df)<- df[,1]
#remove remaining gene column
df = subset(df, select = -c(genes))
#change column names to sample names
colnames(df)<- data.frame(GSMnames)
#cleanup
rm(files, genes, GSMnames)
Then you can plug these counts into DESeq2 or EdgeR , you may have to make an appropriate meta data so you can setup your comparisons accordingly to generate a list of differentially expressed genes after followin the DESeq2 workflow.
Ideally though, you may want to disregard everything I typed above this, because I think it could be in your best interest to do what rpolicastro was mentioning in the first comment here:
Alternatively, GEO provides links to the accompanying SRA entry
containing the fastq files for those samples. With the fastq files you
can run through a workflow such as Salmon + DESeq2 to find
differentially expressed genes.
which is to download the raw FASTQ files and then plugging them into Salmon + DESeq2. You have so much more control of everything that way, in my opinion. I, personally, like to be in control... This may require a bunch of more hoops to jump through through like installing conda
and also snakemake
if you use the tutorial rpolicastro linked. It's not too bad though.
To download the fastq files, I, personally, use the sra-explorer website and aspera (aspera allows you to download fastq files much faster): sra-explorer : find SRA and FastQ download URLs in a couple of clicks
You could google how to download and install aspera... or check out just Step 1 of this tutorial: [Deprecated] Fast download of FASTQ files from the European Nucleotide Archive (ENA) (mind you, there is a newer version of apsera out now so some of the step might be a bit different, but if you want go down this route, this should get you started) (Remember you only need Step 1 in this tutorial)
EDIT 08.27.2021: basically if you want to go the fastq file route you should download, install aspera, and add it to your $PATH
and then use sra-explorer to get the aspera download links/commands. you can make the aspera download commands to a .sh file and run it in terminal to quickly download all of the fastq files you need
The count files are provided for each sample as a tsv in supplementary files, which you can aggregate and use as input to DESeq2 or edgeR. Alternatively, GEO provides links to the accompanying SRA entry containing the fastq files for those samples. With the fastq files you can run through a workflow such as Salmon + DESeq2 to find differentially expressed genes.
i am not seeing any count file
Go to a sample page (such as this), go to the bottom where it says "Supplementary file" and there should be a text file corresponding to the sample (e.g. GSM4954457_A_1_Asymptom.txt.gz).
but how can I aggregate them and can aggregate and use as input to DESeq2 or edgeR? I have tried this:
Can I proceed with this?
When I click on SRA run selector I did not find any fastq files .plz help me
For sra-explorer
Check out that link again: sra-explorer : find SRA and FastQ download URLs in a couple of clicks
You have to use SRA-explorer to search your SRA number for your dataset of interest.
So if you go to your dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE162562&fbclid=IwAR0iZQhttG8HzGhFIIMWbFgNszQrVDgiyVChYzQ_ypCx_d-1pn_tm7STjGs
and then scroll to where it says SRA (above Download Famiy). You will see
SRP295561
this is the SRA number you search and then I think you can follow the tutorial using the links above. There may be a better way such as using the sra-toolkit or something, but I kind of like this way because you can use aspera, which allows you to download the large fastq files faster.So what sra-explorer.info will provide are commands for your to make into a bash/shell script to run to download. Make sure you have enough storage on your computer/server/etc....
After finding DEGs through deseq2, how can I apply topgo to do functional enrichment analysis?