Question

Data preparation for a ML model

0

Entering edit mode

16 months ago

sil_bioinfo ▴ 50

Hello,

How can I extract useful information from RNA Seq data from "BioStudies, Array Express" website? https://www.ebi.ac.uk/biostudies/arrayexpress/studies

I want to create a machine learning model to correctly classify LTBI and active TB patients using this data: https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-7830

The raw data is: https://www.ebi.ac.uk/ena/browser/view/PRJEB31975

I mean, through R or Python, how can I extract information (genes, biomarkers, etc) from this data? Could you recommend me any R package? I have searched for tutorials, workflows where I could use this kind of data to get an idea but I have not found something recent and useful for me now... could you help me please? Thank you in advance

machine-learning RNA-Seq R python biomarkers • 1.8k views

ADD COMMENT • link 15 months ago by sil_bioinfo ▴ 50

1

Entering edit mode

but I have not found something recent and useful for me now

This is RNA-seq data, i suggest starting with https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html

ADD REPLY • link 16 months ago by jv ★ 1.8k

0

Entering edit mode

this workflow is very complete, thank you! Could you help me to find any workflow where RNA-Seq dataset from ArrayExpress is used, please? I just need to know how could I manage this kind of data, since it's too big..

ADD REPLY • link 16 months ago by sil_bioinfo ▴ 50

0

Entering edit mode

Could you help me to find any workflow where RNA-Seq dataset from ArrayExpress is used,

There are no universal workflows. You're best best would be to review the Methods section of the corresponding publication to see what they did to process the data.

I am surprised to see that only the raw fastq data is provided for this study. Contact the authors of the study and request the raw gene counts and sample meta data.

ADD REPLY • link 16 months ago by jv ★ 1.8k

0

Entering edit mode

Yeah, I understand. I'm just trying to find a workflow with this kind of data from ArrayExpress to know how I can handle it.. since they are such a big files. Not necessarily with this dataset, even if I have to work with it later.. I'm just trying to find a way how I can handle this kind of files, and have an idea how can I preprocess them.. I have found a lot of workflows that use GEO datasets, and others... but none of ArrayExpress...

ADD REPLY • link 16 months ago by sil_bioinfo ▴ 50

0

Entering edit mode

There's nothing "special" or different about ArrayExpress versus GEO. they are both repositories for sequencing data. The important differences are in how the samples were prepped and what kind of RNA-seq library was made.

ADD REPLY • link 16 months ago by jv ★ 1.8k

0

Entering edit mode

English is not my first language, sorry if I don't make myself clear... What I'm trying to ask is, for the raw data for this study (158 fastq files, about 700 GB in total): https://www.ebi.ac.uk/ena/browser/view/PRJEB31975?show=reads

how could I handle if I want to preprocess it? I.e. with FastQC, kallisto, etc.. I know I need to use supercomputers etc.. but to use this huge amount of data with FastQC for example, how could I do it if I don't even think I can download it on my computer.. with GEO I know how to do it, since I saw several workflows, but with ArrayExpress/ ENA I don't know how to do it..

I think I will ask a new more specific question about this...

ADD REPLY • link 16 months ago by sil_bioinfo ▴ 50

0

Entering edit mode

Contact the authors of the study and request the raw gene counts and sample meta data

I have both files and the list of differentially expressed (DE) genes, what should I do now?

ADD REPLY • link 15 months ago by sil_bioinfo ▴ 50

0

Entering edit mode

Add more information. For example, attach some lines of your input file.

ADD REPLY • link 16 months ago by Shred ★ 1.6k

0

Entering edit mode

The dataset is: https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-7830/sdrf

It's transcriptomic data. This dataset is composed of 158 patients (LTBI, active TB & healthy patients). All the data from patients are here: https://www.ebi.ac.uk/ena/browser/view/PRJEB31975

Since there is so much data, and such big files, I don't know how to handle them... where should I start?

ADD REPLY • link 16 months ago by sil_bioinfo ▴ 50

1

Entering edit mode

It's rather bizarre that you want to create a classifier without knowing the source of your data. I guess you could start by.. knowing what an RNAseq experiment is? Things like library size, batch effects, normalizations.. your data will never be "ready to use" like a Kaggle dataset, you'd had to need to apply multiple preprocessing step before even start to think about a classifier.

ADD REPLY • link 16 months ago by Shred ★ 1.6k

0

Entering edit mode

did you read the question? that's exactly what I'm asking... how can I handle this kind of big data so that I could use it later to build an ML model.. I'm not asking for help in building the model, I'm asking for help on how to handle this kind of data... thank you

ADD REPLY • link 16 months ago by sil_bioinfo ▴ 50