FDA Collection:

FDA Collection: http://hdl.handle.net/2451/14090 Fri, 22 May 2026 00:12:26 GMT 2026-05-22T00:12:26Z Hierarchical Latent Context Representation for CARS http://hdl.handle.net/2451/61218 Title: Hierarchical Latent Context Representation for CARS Authors: Unger, Moshe; Tuzhilin, Alexander Abstract: In this paper, we propose a hierarchical representation of latent contextual information that captures contextual situations in which users are recommended particular items. We also introduce an algorithm that converts unstructured latent contextual information into structured hierarchical representations. In addition, we present two general context-aware recommendation algorithms that extend collaborative filtering (CF) approaches and utilize structured and unstructured latent contextual information. In particular, the first algorithm utilizes structured latent contexts and the second one combines the structured and the unstructured latent contextual representations. By using latent contextual information in a recommendation model, we capture and represent both the structure of the latent context in the form of a hierarchy and the values of contextual variables in the form of an unstructured vector. We tested the two proposed methods with two CF-based methods on several context-rich datasets under different experimental settings. We show that using hierarchical latent contextual representations leads to significantly better recommendations than the baselines for the datasets having high- and medium-dimensional contexts. Although this is not the case for the low-dimensional contextual data, the hybrid approach, combining structured and unstructured latent contextual information, significantly outperforms other baselines across all the experimental settings and dimensions of contextual data. Wed, 01 May 2019 00:00:00 GMT http://hdl.handle.net/2451/61218 2019-05-01T00:00:00Z A Quality-Aware Optimizer for Information Extraction http://hdl.handle.net/2451/25886 Title: A Quality-Aware Optimizer for Information Extraction Authors: Jain, Alpa; Ipeirotis, Panagiotis G. Abstract: Large amounts of structured information is buried in unstructured text. Information extraction systems can extract structured relations from the documents and enable sophisticated, SQL-like queries over unstructured text. Information extraction systems are not perfect and their output has imperfect precision and recall (i.e., contains spurious tuples and misses good tuples). Typically, an extraction system has a set of parameters that can be used as ``knobs'' and tune the system to be either precision- or recall-oriented. Furthermore, the choice of documents processed by the extraction system also affects the quality of the extracted relation. So far, estimating the output quality of an information extraction task was an ad-hoc procedure, based mainly on heuristics. In this paper, we show how to use receiver operating characteristic (ROC) curves to estimate the extraction quality in a statistically robust way and show how to use ROC analysis to select the extraction parameters in a principled manner. Furthermore, we present analytic models that reveal how different document retrieval strategies affect the quality of the extracted relation. Finally, we present our maximum likelihood approach for estimating---on the fly---the parameters required by our analytic models to predict the run time and the output quality of each execution plan. Our experimental evaluation demonstrates that our optimization approach predicts accurately the output quality and selects the fastest execution plan that satisfies the output quality restrictions. Sat, 08 Mar 2008 01:24:05 GMT http://hdl.handle.net/2451/25886 2008-03-08T01:24:05Z Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers http://hdl.handle.net/2451/25882 Title: Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Authors: Sheng, Victor; Provost, Foster; Ipeirotis, Panagiotis G. Abstract: This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial. Thu, 06 Mar 2008 02:14:03 GMT http://hdl.handle.net/2451/25882 2008-03-06T02:14:03Z Does Chatter Matter? The Impact of User-Generated Content on Music Sales http://hdl.handle.net/2451/23783 Title: Does Chatter Matter? The Impact of User-Generated Content on Music Sales Authors: Dhar, Vasant; Chang, Elaine Abstract: The Internet has enabled the era of user-generated content, potentially breaking the hegemony of traditional content generators as the primary sources of “legitimate” information. Prime examples of user-generated content are blogs and social networking sites, which allow easy publishing of and access to information. In this study, we examine the usefulness of such content, consisting of data from blogs and social networking sites in predicting sales in the music industry. We track the changes in online chatter for a sample of 108 albums for four weeks before and after their release dates. We use linear and nonlinear regression to identify the relative significance of online variables on their observation date in predicting future album unit sales two weeks ahead Our findings are as follows: (a) the volume of blog posts about an album is positively correlated with future sales, (b) greater increases in an artist’s Myspace friends week over week have a weaker correlation to higher future sales, (c) traditional factors are still relevant – albums released by major labels and albums with a number of reviews from mainstream sources like Rolling Stone also tended to have higher future sales. More generally, the study provides some preliminary answers for marketing managers interested in assessing the relative importance of the burgeoning number of “Web 2.0” information metrics that are becoming available on the Internet, and how looking at interactions among them could provide predictive value beyond viewing them in isolation. The study also provides a framework for thinking about when user-generated content influences decision making. Wed, 24 Oct 2007 14:36:47 GMT http://hdl.handle.net/2451/23783 2007-10-24T14:36:47Z