Entering edit mode
4.6 years ago
mi_kappa
•
0
Hi,
In a R dataset (in assays) there are counts and normalized counts stored. Is it safe to assume that counts are the raw counts? If not, is there a way that someone can check whether they have the raw counts?
Have you checked to see if you have integers only or real numbers? Raw counts should generally be whole integers.
This would be integers right?
ENSG00000000419 29534 23742 24648 18752 16204
This is indeed integers but there are tools like
tximport
which aggregate transcript level abundances to raw gene level counts which are floats. Any background on these data, how it was produced?I did check whether the medians align in the counts matrix and indeed it doesn't.
The quality of the raw reads was checked using FastQC, the adaptors were clipped using cutadapt, and Sickle was used to trim low-quality ends from the reads. Read alignment was performed using STAR and mapping statistics from the BAM files were acquired through SAMtools flagstat. The expression on the gene, exon, exon ratio and poly(A) ratio levels used Ensembl v.71 annotation. The reason why I am asking is because after running DESEeq on what I believed to be is raw counts I get some very weird results. I know that DESEq requires raw counts and I wanted to make sure whether there is an issue with the counts matrix. I am using DESeqDataSetFromMatrix do you think from what i described I should be using something else?
Note: there are no outliers in the dataset and I have pre-filtered low count genes.
Please forget about what I previously said about the medians, I did not properly test this suggestion so it might not be correct. Do you have access to the raw fastq files? If so why not simply rerunning everything to be sure?
Thank you for your help! I do not have access to the FASTA files. I have been given access to what is called counts and norm.counts in the assays because I am performing differential gene expression analysis. For that I am using the counts (assuming it is raw counts). Do you by any chance know if DESeq would give me an error or a warning if I was not using raw counts?
It is always best to ask the source of data for clarification in cases such as this because a wrong assumption can ruin your work.
Based on your description it does sound like counts are raw. You can test by normalizing them in DESeq2. Why are you filtering low counts? Use the data as is (which is what the original person may have done).
I am waiting to get clarifications, I was just wondering whether there was a way to check myself. Pre-filtering of low count genes is an optional step before you run DESeq. I am only removing rows with no reads or or nearly no reads (0 or 1) to reduce memory size of the dds object and increase the speed and the transformation and testing of the functions within DEseq.