Entering edit mode
9.5 years ago
Deepak Tanwar
★
4.2k
I read some presentations and papers regarding smoothing techniques.
- Smoothing N-gram Language models
- An Empirical Study of Smoothing Techniques for Language Modeling
- N-gram models
- Improved Smoothing for N-gram Language Models Based on Ordinary Counts
- Smoothing Language Models
- NLP Lunch Tutorial: Smoothing
- Language models
I want to apply Smoothing on a data, containing zero values. Which one should be the best?
This is just an example:
Pathway1 Pathway2 Pathway3 Pathway4
Calcium ions 0 3 1 0
ATP 2 1 0 7
Sorry Deepak, I don't really understand - smoothing in my mind is something you do to continuous data, like a time series or genomic data. Your example of pathways is categorical data, in that Pathway2 doesn't really come before Pathway3 or after Pathway1, they are just categories.
So how do you want this data to look like?
Ultimately, the best smoothing algorithm is the one that is well described/understood to anyone who has to look at the result :)
Although it would never stand up in any other aspect of science, too often when it comes to smoothing of data or intersection of genomic coordinates, you see "then we did [stuff we're not even going to detail in the appendix] - and the result was [bold claims of novel biology]".
John, it was just an example, and I am not going to do with pathways anything. I can't disclose what I want to do. I have already used Good Touring estimate, Witten Bell smoothing. To explain further, I can say that, suppose you have a list of 30 people and a list of 500 software tools. You create a table, columns as name of people and rows as name of the software tool. you fill the value in each cell for the number of times, that person used that software in last 10 years:
I want to replace the 0 counts. One way is Laplace smoothing by adding 1 to each value.
I hope, I made it clear.