Entering edit mode
6.1 years ago
maryak
▴
20
Is there any way to determine that have i calculated the right CPM values from htseq -counts data? i calculated CPM using R
Is there any way to determine that have i calculated the right CPM values from htseq -counts data? i calculated CPM using R
Hey, please paste the commands that you have used, right from the word 'go' (i.e. from the beginning of your script) and also what are your experiment goals?
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
My goal is to identify cancer related genes provided htseq-counts data
Okay, cool! The only thing that I would say is if you are aiming to derive data for downstream analyses, like network analysis, machine learning, et cetera, then you may want to add
log=TRUE
to yourcpm()
function in order to produce log [base 2] CPM.Also, you should then plot the CPM data via the hist() function. It should roughly follow a normal distribution but may have 2 'humps'.
The histogram that i obtained has a long bar at the start and rest all of the bars are very small .The frequency of values ranging between 0 and 1 is higher??? is that what you mean by normal distribution??
No, does not sound normal at all. Sounds like a negative binomial. Did you specify
log=TRUE
?Yes.i did added log=true in cpm function
Without seeing your histogram, I cannot really comment further. You may increase the value of
breaks
that you pass to the histogram.Edit: I also assume that you filtered out low count transcripts prior to normalisation. Is there any reason why your data would have very high low-count transcripts?
why are you so much emphasizing on normal distribution ?? why it is not ok if my normalized data have negative values?? i must have uploaded histogram if there is some option to upload image.May be negative values show down regulated and positive values show unregulated.
You have asked people to check your data distribution but you have not even shown it...
Take a look here: How to add images to a Biostars post
Some downstream programs require data to be normally-distributed; others do not. What is your goal?, i.e., what is your downstream analysis plan? Stating 'identify cancer related genes' is not sufficient information, unfortunately.
histogram
I have obtained the histogram. and i am going to classify (identify) cancer related genes using machine learning algorithm(PSO +SVM)
Cool - thank you. It really looks like you have a lot of low count transcripts, which is distorting the data distribution. The current distribution that you have is not suited for machine learning algorithms (assuming that the algorithms that you are aiming to use expect data to be normally-distributed - you, as the analyst, need to check this).
With your raw counts, did you do any pre-filtering for low count transcripts? For example, remove transcripts whose mean raw count < 10?
Going back a bit, you should also use
normalized.lib.sizes = TRUE
and could scale up yourprior.count
to be1
:can u please tell me how to pre -filter?
He is trying, but you aren't answering his questions.
Knowing the right distribution for your data is absolutely critical for any statistical analyses. If you data looks oddly distributed (which seemingly yours does), that might be indicative of bad data, or bad upstream data processing.
You cannot feed data in to a machine learning algorithm and magically expect results you want.
Thanks @jrj.healey!
To filter, you could try this:
Out of curiosity, what is the result of
table(meanvals < 10)
when you run the above code?Table(meanvals<10 ) returns me
This is how my code looks like now
Thank you. There is the problem: You have ~40,000 genes that have low values.
I do not know what is your dataset, but it would make sense if these ~40,000 genes were non-coding RNAs or pseudogenes. The ~20,000 figure of genes with raw count >= 10 corresponds roughly to the number of protein coding genes in the human transcriptome.
You could reduce the count threshold to 5, if you want, or leave it at 10. Either way, these low count genes should be removed.
thanks a lot for your generous help
@jrj.healey please only comment if you have answer to my question..don,t state that, has already been stated.
maryam.akram001 as a moderator, it is part of my job to intervene in threads. I can, will, and do, comment wherever I please.
Your attitude is not conducive to receiving help from Kevin, or anyone else for that matter. Everyone on this site is volunteering their time, and if you hadn't noticed, its also currently the weekend. Your part, as a poster of a question, is to provide all the details needed to help resolve the issue, and make the lives of the people contributing answers as simple as possible.
Think again about how you interact on the forum, if you expect to keep receiving help...