Skip navigation
Full metadata record
DC FieldValueLanguage
dc.contributor.authorElmagarmid, Ahmed-
dc.contributor.authorIpeirotis, Panagiotis G.-
dc.contributor.authorVerykios, Vassilios-
dc.date.accessioned2006-06-21T17:23:58Z-
dc.date.available2006-06-21T17:23:58Z-
dc.date.issued2006-09-06-
dc.identifier.urihttp://hdl.handle.net/2451/14760-
dc.description.abstractOften, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats or any combination of these factors. In this article, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with a coverage of existing tools and with a brief discussion of the big open problems in the area.en
dc.format.extent252066 bytes-
dc.format.mimetypeapplication/pdf-
dc.languageEnglishEN
dc.publisherStern School of Business, New York Universityen
dc.relation.ispartofseriesCeDER-06-05en
dc.subjectentity resolutionen
dc.subjectduplicate detectionen
dc.subjectrecord matchingen
dc.subjectrecord linkageen
dc.subjectinstance identificationen
dc.subjectdeduplicationen
dc.subjectmerge-purgeen
dc.subjectcoreference resolutionen
dc.subjectdatabase hardeningen
dc.titleDuplicate Record Detection: A Surveyen
dc.typeWorking Paperen
dc.description.seriesInformation Systems Working Papers SeriesEN
Appears in Collections:CeDER Working Papers
IOMS: Information Systems Working Papers

Files in This Item:
File Description SizeFormat 
tkde2007.pdf350.32 kBAdobe PDFView/Open


Items in FDA are protected by copyright, with all rights reserved, unless otherwise indicated.