Question

RNAseq Data and Pipeline

0

Entering edit mode

2.2 years ago

turcoa1 • 0

Hello,

I am currently trying to do differential gene expression with TCGA data imported with R (GDCquery). I want to utilize DESeq2 as it is widely used and want publication quality results. I am struggling to understand where "raw" counts are found for star - counts. Does anyone know if the "unstranded" column is truly raw counts, or is it RSEM data? I want raw data to work with for DESeq2. Please let me know about any of these questions.

RNA-Seq DESeq2 Differential-Gene-Expression • 1.7k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 2.2 years ago by turcoa1 • 0

0

Entering edit mode

What data are you downloading? Can you show us your GDCquery commands please?

ADD REPLY • link 2.2 years ago by Ram 45k

0

Entering edit mode

query <- GDCquery(
  project = "TCGA-THCA", 
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts",
  barcode = listSamples
)

This is my query. I am trying to download gene expression data in raw form for Differential gene expression analysis. I kept males and females separate so listSample in this case is the barcodes for all the male samples or female samples (kept separate).

ADD REPLY • link updated 2.2 years ago by Ram 45k • written 2.2 years ago by turcoa1 • 0

0

Entering edit mode

Please provide your entire code so those of us wishing to do a quick test can download and check for ourselves. Also, please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. If your code has long lines with a single command, break those lines into multiple lines with proper escape sequences so they're easier to read and still run when copy-pasted. I've done it for you this time.
code_formatting

ADD REPLY • link 2.2 years ago by Ram 45k

0

Entering edit mode

my entire code is not done. All I am trying to understand is if the STAR - counts downloaded from this query have raw reads. There are headings like "unstranded", "stranded", and then the normalization ones as well but I am wondering if the "unstranded" column are the raw reads or if they have been normalized in some way (RSEM). I cannot find this information online.

ADD REPLY • link 2.2 years ago by turcoa1 • 0

0

Entering edit mode

There is at least one more step that you haven't shared, isn't there? This line alone does not get you the data. You don't need to give us all values of listSamples, for instance, but 2-3 sample values should help. Look up reprex - that's the kind of code you need to give us.

ADD REPLY • link 2.2 years ago by Ram 45k

0

Entering edit mode

STAR has a GeneCounts output that looks as you describe; three columns, depending on whether one thinks the prep is unstranded, or stranded so that R1 goes towards the 3' end, or stranded so that R1 goes towards the 5' end. Illumina stranded TruSeq should be the last option. You should be able to tell by eyeballing which column to use, and it should be clear that they are counts because they will be whole numbers.

RSEM can be run on STAR output, if it is asked to create a transcriptome-based output. But if you have STAR counts, then they are just counts.

ADD REPLY • link 2.2 years ago by swbarnes2 15k

score 1 · Answer 1 · 2023-05-30

1

Entering edit mode

2.2 years ago

ATpoint 89k

STAR counts are the raw counts, that is appropriate for DESeq2.

ADD COMMENT • link 2.2 years ago by ATpoint 89k