|
|
Archive@NYU >
Stern School of Business >
IOMS: Information Systems Working Papers >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/2451/14105
|
| Title: | Significance Testing Against the Random Model for Scoring Models on Top k Predictions |
| Authors: | Macskassy, Sofus |
| Issue Date: | 2005 |
| Publisher: | Stern School of Business, New York University |
| Series/Report no.: | CeDER-05-09 |
| Abstract: | Performance at top k predictions, where instances are ranked by a (learned) scoring model, has
been used as an evaluation metric in machine learning for various reasons such as where the entire
corpus is unknown (e.g., the web) or where the results are to be used by a person with limited time or
resources (e.g., ranking financial news stories where the investor only has time to look at relatively
few stories per day). This evaluation metric is primarily used to report whether the performance
of a given method is significantly better than other (baseline) methods. It has not, however, been
used to show whether the result is significant when compared to the simplest of baselines â the
random model. If no models outperform the random model at a given confidence interval, then the
results may not be worth reporting. This paper introduces a technique to perform an analysis of the
expected performance of the top k predictions from the random model given k and a p-value on an
evaluation dataset D. The technique is based on the realization that the distribution of the number
of positives seen in the top k predictions follows a hypergeometric distribution, which has welldefined
statistical density functions. As this distribution is discrete, we show that using parametric
estimations based on a binomial distribution are almost always in complete agreement with the
discrete distribution and that, if they differ, an interpolation of the discrete bounds gets very close
to the parametric estimations. The technique is demonstrated on results from three prior published
works, in which it clearly shows that even though performance is greatly increased (sometimes over
100%) with respect to the expected performance of the random model (at p = 0.5), these results,
although qualitatively impressive, are not always as significant (p = 0.1) as might be suggested
by the impressive qualitative improvements. The technique is used to show, given k, both how
many positive instances are needed to achieve a specific significance threshold is as well as how
significant a given top k performance is. The technique when used in a more global setting is able
to identify the crossover points, with respect to k, when a method becomes significant for a given
p. Lastly, the technique is used to generate a complete confidence curve, which shows a general
trend over all k and visually shows where a method is significantly better than the random model
over all values of k. |
| URI: | http://hdl.handle.net/2451/14105 |
| Appears in Collections: | CeDER Working Papers IOMS: Information Systems Working Papers
|
Items in Faculty Digital Archive are protected by copyright, with all rights reserved, unless otherwise indicated.
|