Hello,
How can I extract useful information from RNA Seq data from "BioStudies, Array Express" website? https://www.ebi.ac.uk/biostudies/arrayexpress/studies
I want to create a machine learning model to correctly classify LTBI and active TB patients using this data: https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-7830
The raw data is: https://www.ebi.ac.uk/ena/browser/view/PRJEB31975
I mean, through R or Python, how can I extract information (genes, biomarkers, etc) from this data? Could you recommend me any R package? I have searched for tutorials, workflows where I could use this kind of data to get an idea but I have not found something recent and useful for me now... could you help me please? Thank you in advance
This is RNA-seq data, i suggest starting with https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html
this workflow is very complete, thank you! Could you help me to find any workflow where RNA-Seq dataset from ArrayExpress is used, please? I just need to know how could I manage this kind of data, since it's too big..
There are no universal workflows. You're best best would be to review the Methods section of the corresponding publication to see what they did to process the data.
I am surprised to see that only the raw fastq data is provided for this study. Contact the authors of the study and request the raw gene counts and sample meta data.
Yeah, I understand. I'm just trying to find a workflow with this kind of data from ArrayExpress to know how I can handle it.. since they are such a big files. Not necessarily with this dataset, even if I have to work with it later.. I'm just trying to find a way how I can handle this kind of files, and have an idea how can I preprocess them.. I have found a lot of workflows that use GEO datasets, and others... but none of ArrayExpress...
There's nothing "special" or different about ArrayExpress versus GEO. they are both repositories for sequencing data. The important differences are in how the samples were prepped and what kind of RNA-seq library was made.
English is not my first language, sorry if I don't make myself clear... What I'm trying to ask is, for the raw data for this study (158 fastq files, about 700 GB in total): https://www.ebi.ac.uk/ena/browser/view/PRJEB31975?show=reads
how could I handle if I want to preprocess it? I.e. with FastQC, kallisto, etc.. I know I need to use supercomputers etc.. but to use this huge amount of data with FastQC for example, how could I do it if I don't even think I can download it on my computer.. with GEO I know how to do it, since I saw several workflows, but with ArrayExpress/ ENA I don't know how to do it..
I think I will ask a new more specific question about this...
I have both files and the list of differentially expressed (DE) genes, what should I do now?
Add more information. For example, attach some lines of your input file.
The dataset is: https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-7830/sdrf
It's transcriptomic data. This dataset is composed of 158 patients (LTBI, active TB & healthy patients). All the data from patients are here: https://www.ebi.ac.uk/ena/browser/view/PRJEB31975
Since there is so much data, and such big files, I don't know how to handle them... where should I start?
It's rather bizarre that you want to create a classifier without knowing the source of your data. I guess you could start by.. knowing what an RNAseq experiment is? Things like library size, batch effects, normalizations.. your data will never be "ready to use" like a Kaggle dataset, you'd had to need to apply multiple preprocessing step before even start to think about a classifier.
did you read the question? that's exactly what I'm asking... how can I handle this kind of big data so that I could use it later to build an ML model.. I'm not asking for help in building the model, I'm asking for help on how to handle this kind of data... thank you