Question

Differential Gene Expression analysis in Bulk RNA Seq - using Count Matrix as input

0

Entering edit mode

16 months ago

applepie ▴ 10

Hello everyone, I am going to do the differential gene expression (DEG) analysis in the bulk RNA seq data. The sample used are the NAFLD samples downloaded from the NCBI Gene Expression Omnibus (GEO) (link to the dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135251). When I attempted to download the datasets, I realized that there are so many Count Matrix provided (see the attached photo). Regarding this, I have several questions:

1) May I ask if it is normal to have so many count matrices there? 2) If Yes, which count matrix should I use for downstream DEG analysis by DESeq2? Or should I use all the count matrix to do the analysis?

Thank you!

enter image description here

DESeq2 BulkRNASeq • 1.7k views

ADD COMMENT • link updated 13 months ago by ATpoint 85k • written 16 months ago by applepie ▴ 10

score 2 · Answer 1 · 2023-08-01

Each file contains one column with the counts for that sample. You can load that all into R and combine into a single matrix of raw counts. For this, download via Select All, that will give a tarball (.tar). Unpack that tarball with tar xf that.tar. Then use this snipped in R:

# list all files from the tarball (unpack tarball in bash with tar xf tarball.tar)
listed <- list.files("/Users/atpoint/Downloads/data/", pattern="^GSM", full.names=TRUE)
listed <- grep("txt.gz$", listed, value=TRUE)

# load every single file
raw.counts <- lapply(listed, function(x){

  r <- read.delim(x, header=FALSE, row.names=1)
  colnames(r) <- gsub("\\.counts.*", "", basename(listed[1]))
  r

})

# combine
raw.counts <- do.call(cbind, raw.counts)
raw.counts[1:3,1:3]
raw.counts[1:3,1:3]
                GSM3998167_017-Ann-Daly_S1 GSM3998167_017-Ann-Daly_S1.1 GSM3998167_017-Ann-Daly_S1.2
ENSG00000000003                       2565                         2400                         2391
ENSG00000000005                          0                           14                            0
ENSG00000000419                        605                          525                          709

This you can then use for DE analysis via DESeq2/edgeR/limma...

score 1 · Answer 2 · 2023-08-01

1

Entering edit mode

16 months ago

Pei ▴ 220

I guess that each counts.txt.gz is just 1 sample. So you will find a total of 216 counts.txt.gzs. Each counts.txt.gz may be used as one column in your count matrix for the downstream DE analysis. Am I right?

ADD COMMENT • link 16 months ago by Pei ▴ 220

score 0 · Answer 3 · 2023-08-01

0

Entering edit mode

16 months ago

Ayeh • 0

in my mind, every count txt.gz file is for one sample and you can great count matrix for DEG analysis by combine txt files column-wise.

ADD COMMENT • link 16 months ago by Ayeh • 0