Explaining Documents' Classiﬁcations

Martens, David; Provost, Foster

Full metadata record

DC Field	Value	Language
dc.contributor.author	Martens, David	-
dc.contributor.author	Provost, Foster	-
dc.date.accessioned	2011-03-15T15:16:34Z	-
dc.date.available	2011-03-15T15:16:34Z	-
dc.date.issued	2011-03-15T15:16:34Z	-
dc.identifier.uri	http://hdl.handle.net/2451/29918	-
dc.description.abstract	This is a design-science paper about methods for explaining data-driven classiﬁcations of text documents. Document classiﬁcation has widespread applications, such as with web pages for advertising, emails for legal discovery, blog entries for sentiment analysis, and many more. Document data are characterized by very high dimensionality, often with tens of thousands to millions of variables (words). Many applications require human understanding of the reasons for classiﬁcation decisions: by managers, client-facing employees, and the technical team. Unfortunately, due to the high dimensionality, understanding the decisions made by the document classiﬁers is very difficult. Previous approaches to gain insight into black-box models do not deal well with high-dimensional data. Our main theoretical contribution is to deﬁne a new sort of explanation, tailored to the business needs of document classiﬁcation and able to cope with the associated technical constraints. Speciﬁcally, an explanation is deﬁned as a set of words (terms, more generally) such that removing all words within this set from the document changes the predicted class from the class of interest. We present an algorithm to ﬁnd such explanations, as well as a framework to assess such an algorithm's performance. We demonstrate the value of the new approach with a case study from a real-world document classiﬁcation task: classifying web pages as containing adult content, with the goal of allowing advertisers to choose not to have their ads appear there. We present a further empirical demonstration on news-story topic classiﬁcation using the 20 Newsgroups benchmark dataset. The results show the explanations to be concise and document-speciﬁc, and to provide insight into the exact reasons for the classiﬁcation decisions, into the workings of the classiﬁcation models, and into the business application itself. We also illustrate how explaining documents' classiﬁcations can help to improve data quality and model performance.	en
dc.description.sponsorship	NYU Stern School of Business, University of Antwerp	en
dc.language.iso	en_US	en
dc.relation.ispartofseries	CeDER-11-01	-
dc.title	Explaining Documents' Classiﬁcations	en
dc.type	Working Paper	en
dc.authorid-ssrn	691208	en
Appears in Collections:	CeDER Working Papers

Files in This Item:

There are no files associated with this item.

Show simple item record