Duplicate Record Detection: A Survey

Elmagarmid, Ahmed; Panagiotis, Ipeirotis; Verykios, Vassilios

Full metadata record

DC Field	Value	Language
dc.contributor.author	Elmagarmid, Ahmed	-
dc.contributor.author	Panagiotis, Ipeirotis	-
dc.contributor.author	Verykios, Vassilios	-
dc.date.accessioned	2008-12-09T16:00:13Z	-
dc.date.available	2008-12-09T16:00:13Z	-
dc.date.issued	2007-01	-
dc.identifier.citation	IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 19, no. 1, January 2007	en
dc.identifier.uri	http://hdl.handle.net/2451/27823	-
dc.description.abstract	Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area.	en
dc.description.sponsorship	NYU, Stern School of Business, IOMS Department, Center for Digital Economy Research	en
dc.format.extent	358724 bytes	-
dc.format.mimetype	application/pdf	-
dc.language.iso	en_US	en
dc.publisher	IEEE	en
dc.relation.ispartofseries	CeDER-PP-2007-15	en
dc.subject	duplicate detection	en
dc.subject	data cleaning	en
dc.subject	data integration	en
dc.subject	record linkage	en
dc.subject	instance identification	en
dc.subject	database hardening	en
dc.subject	name matching	en
dc.subject	identity uncertainty	en
dc.subject	entity resolution	en
dc.subject	fuzzy duplicate detection	en
dc.subject	entity matching	en
dc.title	Duplicate Record Detection: A Survey	en
dc.type	Article	en
Appears in Collections:	CeDER Published Papers

Files in This Item:

File	Description	Size	Format
CeDER-PP-2007-15.pdf		350.32 kB	Adobe PDF	View/Open

Show simple item record