Significance Testing Against the Random Model for Scoring Models on Top k Predictions
|Publisher:||Stern School of Business, New York University|
|Abstract:||Performance at top k predictions, where instances are ranked by a (learned) scoring model, has been used as an evaluation metric in machine learning for various reasons such as where the entire corpus is unknown (e.g., the web) or where the results are to be used by a person with limited time or resources (e.g., ranking financial news stories where the investor only has time to look at relatively few stories per day). This evaluation metric is primarily used to report whether the performance of a given method is significantly better than other (baseline) methods. It has not, however, been used to show whether the result is significant when compared to the simplest of baselines â the random model. If no models outperform the random model at a given confidence interval, then the results may not be worth reporting. This paper introduces a technique to perform an analysis of the expected performance of the top k predictions from the random model given k and a p-value on an evaluation dataset D. The technique is based on the realization that the distribution of the number of positives seen in the top k predictions follows a hypergeometric distribution, which has welldefined statistical density functions. As this distribution is discrete, we show that using parametric estimations based on a binomial distribution are almost always in complete agreement with the discrete distribution and that, if they differ, an interpolation of the discrete bounds gets very close to the parametric estimations. The technique is demonstrated on results from three prior published works, in which it clearly shows that even though performance is greatly increased (sometimes over 100%) with respect to the expected performance of the random model (at p = 0.5), these results, although qualitatively impressive, are not always as significant (p = 0.1) as might be suggested by the impressive qualitative improvements. The technique is used to show, given k, both how many positive instances are needed to achieve a specific significance threshold is as well as how significant a given top k performance is. The technique when used in a more global setting is able to identify the crossover points, with respect to k, when a method becomes significant for a given p. Lastly, the technique is used to generate a complete confidence curve, which shows a general trend over all k and visually shows where a method is significantly better than the random model over all values of k.|
|Appears in Collections:||CeDER Working Papers|
IOMS: Information Systems Working Papers
Items in FDA are protected by copyright, with all rights reserved, unless otherwise indicated.