Question

Analyzing LogRPKM Counts Data for finding DEGs(Differential Expressed Genes)

0

Entering edit mode

5.1 years ago

naseerkhan861 ▴ 10

I have a log2RPKM counts data that I want to analyze to find differentialy expressed genes from this data, In my case I am trying to analyze data from the GEO102741 dataset which has both control and subjects data

My questions are following

1) The data given is log2RPKM counts data so do I need to convert these counts to some other format using some technique, is there some free online tool that I can use to convert this data to some format that is required for Gene Expressions

2) How can I find DEGs(Deferentially Expressed Genes) From this dataset , is there some online state of the art tool for free that I can use to find DEGs

I am new to this DEGs and RNASeq datasset domain so I apologize if this question is too naive.

What is my GOAL?

My end goal is that I want to perform clustering of genes in the dataset in both control and autism separately and want to see how many clusters perform and will then dig deeper , for these clusters like how the clusters vary in size in two groups and then I will use DEGs genes and also cluster them and this time I will compare them across different datasets that is in different datasets I will perform these kind of operations and will compare DEGs across datasets.Please if somebody has some useful suggestions then please guide me.

Update

At the this link a huge SRA dataset is available but downloading and running the suggested software on data is not possible for me due to limited bandwidth, lack of storage and lack of computational power.

Regards

RNA-Seq DEG • 1.8k views

ADD COMMENT • link updated 5.1 years ago by ATpoint 85k • written 5.1 years ago by naseerkhan861 ▴ 10

0

Entering edit mode

Check if the dataset comes with raw read counts or not. I would suggest using EdgeR, Limma, DESeq2 for differential gene expression analysis. EdgeR / limma normalizes the count matrix based on the library size. It is not wise to use the normalized expression dataset you mentioned (RPKM - normalized to gene length), find the raw count matrix and then run the mentioned tools. If you don't have the count matrix, download the raw SRA files and run the alignment to count matrix generation pipeline to generate your count matrix.

ADD REPLY • link 5.1 years ago by c.chakraborty ▴ 180

0

Entering edit mode

The site has link to SRA but that dataset is about 650GB in size, I mean that is impossible for me to download , as I don't have so much computing and storage power?

ADD REPLY • link 5.1 years ago by naseerkhan861 ▴ 10

0

Entering edit mode

You can do it chunk-wise. Download 10 samples, quantify, then delete fastq, repeat until finished. Not sure what your bandwidth for download is but this is in principle do-able, see my answer towards how to efficiently download fastq files from ENA.

ADD REPLY • link 5.1 years ago by ATpoint 85k

0

Entering edit mode

EdgeR / limma normalizes the count matrix based on the library size.

I am not well familiar with limma but the edgeR default is TMM where library size is further corrected with a scaling factor that takes into account the library composition. Similar with DESeq2's RLE approach.

ADD REPLY • link 5.1 years ago by ATpoint 85k

0

Entering edit mode

Can you please explain your point, I did not get it fully?

ADD REPLY • link 5.1 years ago by naseerkhan861 ▴ 10

0

Entering edit mode

Wanted to point out that it is not a naive per-million scaling that edgeR does.

ADD REPLY • link 5.1 years ago by ATpoint 85k

0

Entering edit mode

Is there some online tool or resource where I can download that huge SRA data, analyze them and find Gene Expressions across sample and finally find DEGs ?

ADD REPLY • link 5.1 years ago by naseerkhan861 ▴ 10

0

Entering edit mode

Not that I know for RNA-seq. For arrays there is GEO2R within the NCBI GEO environment.

ADD REPLY • link 5.1 years ago by ATpoint 85k

0

Entering edit mode

Since they did not deposit RAW counts, you better reanalyze the FASTQ deposited files if you want to get good results. I had a look at the dataset, it was very interesting. But I have a small question since this is RNA-seq study, I would expect the samples to be collected before a certain time point (Post-mortem Interval, PMI) so that RNA degradation would not happen (also their PCA from supplementary figures 2 &4 does not look convincing). In this article and in the GEO repository, the authors did not provide any details on PMI. Just verify these details with the authors before you start the analysis.

ADD REPLY • link 5.1 years ago by EagleEye 7.6k

0

Entering edit mode

Thanks for your reply, So for this dataset the authors have provided a file "GSE28521_RAW.tar" and a non-nromalized file also of size 3.9 MB and 22 MB respectively, for RAW file which upon extraction gave a file of extension .bgx file , I opened it in notepad ++ but it was kind of metadata file and not the counts as you suggested in a RAW file and for other non-normalized file , it was also not clear as to what it was for. So what can I do with these kind of RAW files and non-normalized file or they are useless. Please suggest.

ADD REPLY • link 5.1 years ago by naseerkhan861 ▴ 10

score 1 · Answer 1 · 2019-10-15

I will never understand why authors do not simply upload the raw count matrix. These RPKM-whatever data are utterly useless. Anyway, you can download the fastq files directly from ENA, see Fast download of FASTQ files from the European Nucleotide Archive (ENA)

From there on I suggest you use a leightweight-quantifier such as salmon to get transcript abundance estimations for each sample. This is computationally-inexpensive and very fast. Then use tximport to summarize these to the gene level (= to get a count matrix, see https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html ) followed by a DEG tool of your choice for the differential analysis. For inspiration see e.g. https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html