Entering edit mode
18 months ago
turcoa1
•
0
Hello,
I am currently trying to do differential gene expression with TCGA data imported with R (GDCquery). I want to utilize DESeq2 as it is widely used and want publication quality results. I am struggling to understand where "raw" counts are found for star - counts. Does anyone know if the "unstranded" column is truly raw counts, or is it RSEM data? I want raw data to work with for DESeq2. Please let me know about any of these questions.
What data are you downloading? Can you show us your GDCquery commands please?
This is my query. I am trying to download gene expression data in raw form for Differential gene expression analysis. I kept males and females separate so
listSample
in this case is the barcodes for all the male samples or female samples (kept separate).Please provide your entire code so those of us wishing to do a quick test can download and check for ourselves. Also, please use the formatting bar (especially the
code
option) to present your post better. You can use backticks for inline code (`text` becomestext
), or select a chunk of text and use the highlighted button to format it as a code block. If your code has long lines with a single command, break those lines into multiple lines with proper escape sequences so they're easier to read and still run when copy-pasted. I've done it for you this time.my entire code is not done. All I am trying to understand is if the STAR - counts downloaded from this query have raw reads. There are headings like "unstranded", "stranded", and then the normalization ones as well but I am wondering if the "unstranded" column are the raw reads or if they have been normalized in some way (RSEM). I cannot find this information online.
There is at least one more step that you haven't shared, isn't there? This line alone does not get you the data. You don't need to give us all values of
listSamples
, for instance, but 2-3 sample values should help. Look up reprex - that's the kind of code you need to give us.STAR has a GeneCounts output that looks as you describe; three columns, depending on whether one thinks the prep is unstranded, or stranded so that R1 goes towards the 3' end, or stranded so that R1 goes towards the 5' end. Illumina stranded TruSeq should be the last option. You should be able to tell by eyeballing which column to use, and it should be clear that they are counts because they will be whole numbers.
RSEM can be run on STAR output, if it is asked to create a transcriptome-based output. But if you have STAR counts, then they are just counts.