Explaining Data-Driven Document Classifications

Martens, David; Provost, Foster

Full metadata record

DC Field	Value	Language
dc.contributor.author	Martens, David	-
dc.contributor.author	Provost, Foster	-
dc.date.accessioned	2013-06-19T14:34:42Z	-
dc.date.available	2013-06-19T14:34:42Z	-
dc.date.issued	2013-06-19	-
dc.identifier.uri	http://hdl.handle.net/2451/31831	-
dc.description.abstract	Many document classification applications require human understanding of the reasons for data-driven classification decisions: by managers, client-facing employees, and the technical team. Predictive models treat documents as data to be classified, and document data are characterized by very high dimensionality, often with tens of thousands to millions of variables (words). Unfortunately, due to the high dimensionality, understanding the decisions made by document classifiers is very difficult. This paper begins by extending the most relevant prior theoretical model of explanations for intelligent systems to account for some missing elements. The main theoretical contribution of the work is the definition of a new sort of explanation as a minimal set of words (terms, more generally), such that removing all words within this set from the document changes the predicted class from the class of interest. We present an algorithm to find such explanations, as well as a framework to assess such an algorithm’s performance. We demonstrate the value of the new approach with a case study from a real-world document classification task: classifying web pages as containing objectionable content, with the goal of allowing advertisers to choose not to have their ads appear there. A second empirical demonstration on news-story topic classification uses the 20 Newsgroups benchmark dataset. The results show the explanations to be concise and document-specific, and to be capable of providing better understanding of the exact reasons for the classification decisions, of the workings of the classification models, and of the business application itself. We also illustrate how explaining documents’ classifications can help to improve data quality and model performance.	en_US
dc.relation.ispartofseries	CBA-13-02;	-
dc.subject	Document Classification, Instance Level Explanation, Text mining, Comprehensibility	en_US
dc.title	Explaining Data-Driven Document Classifications	en_US
Appears in Collections:	Center for Business Analytics Working Papers

Files in This Item:

File	Description	Size	Format
Provost 2_13.02.pdf		608.53 kB	Adobe PDF	View/Open

Show simple item record