|
Archive@NYU >
Stern School of Business >
CeDER Working Papers >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/2451/29918
|
| Title: | Explaining Documents' Classifications |
| Authors: | Martens, David Provost, Foster |
| Issue Date: | 15-Mar-2011 |
| Series/Report no.: | CeDER-11-01 |
| Abstract: | This is a design-science paper about methods for explaining data-driven
classifications of text documents. Document classification has widespread
applications, such as with web pages for advertising, emails for legal
discovery, blog entries for sentiment analysis, and many more. Document
data are characterized by very high dimensionality, often with tens of
thousands to millions of variables (words). Many applications require
human understanding of the reasons for classification decisions: by
managers, client-facing employees, and the technical team.
Unfortunately, due to the high dimensionality, understanding the
decisions made by the document classifiers is very difficult. Previous
approaches to gain insight into black-box models do not deal well with
high-dimensional data. Our main theoretical contribution is to define a
new sort of explanation, tailored to the business needs of document
classification and able to cope with the associated technical
constraints. Specifically, an explanation is defined as a set of words
(terms, more generally) such that removing all words within this set
from the document changes the predicted class from the class of
interest. We present an algorithm to find such explanations, as well as a
framework to assess such an algorithm's performance. We demonstrate the
value of the new approach with a case study from a real-world document
classification task: classifying web pages as containing adult content,
with the goal of allowing advertisers to choose not to have their ads
appear there. We present a further empirical demonstration on news-story
topic classification using the 20 Newsgroups benchmark dataset. The
results show the explanations to be concise and document-specific, and to
provide insight into the exact reasons for the classification decisions,
into the workings of the classification models, and into the business
application itself. We also illustrate how explaining documents'
classifications can help to improve data quality and model performance. |
| URI: | http://hdl.handle.net/2451/29918 |
| Appears in Collections: | CeDER Working Papers
|
Files in This Item:
There are no files associated with this item.
|
All items in Faculty Digital Archive are protected by copyright, with all rights reserved.
|