Fine Tuning NER with spaCy for Transliterated Entities Found in Digital Collections From the Multilingual Arabian/Persian Gulf

Kapan, Almazhan; Kirmizialtin, Suphan; Kukreja, Rhythm; Wrisley, David Joseph

Full metadata record

DC Field	Value	Language
dc.contributor.author	Kapan, Almazhan	-
dc.contributor.author	Kirmizialtin, Suphan	-
dc.contributor.author	Kukreja, Rhythm	-
dc.contributor.author	Wrisley, David Joseph	-
dc.date.accessioned	2022-03-11T13:10:16Z	-
dc.date.available	2022-03-11T13:10:16Z	-
dc.date.issued	2022-03	-
dc.identifier.citation	Kapan, Almazhan, Kirmizialtin, Suphan, Kukreja, Rhythm and Wrisley, David Joseph. (2022). Fine Tuning NER with spaCy for Transliterated Entities Found in Digital Collections From the Multilingual Arabian/Persian Gulf in DHNB 2022 Conference Book of Abstracts (Uppsala, 15-18 March 2022). 69-70.	en
dc.identifier.uri	https://dhnb.eu/conferences/dhnb2022/conference-program/	-
dc.identifier.uri	http://hdl.handle.net/2451/63845	-
dc.description.abstract	Searchable, transcribed cultural heritage text collections have become an important part of digital GLAM. With the democratization of handwritten text recognition (HTR) platforms, this trend of studying and reusing more texts from archives will no doubt continue. The situation presents an ethical dilemma for computational study, however, since archival materials, particularly those of an intercultural nature, or those written in metropolitan languages about the colonized world are poorly served by text processing methods. Processes like named entity recognition (NER), which allow for the semi-automated annotation of transcribed archives, hold much promise for network and spatial analysis of the sources of the historical humanities, and yet state-of-the-art NER systems (e.g. NLTK) are generally trained on English-language corpora with a metropolitan, media focus and, thus, do not accurately identify non-English entities and labels. They do not offer much in terms of tag customization to move beyond western cultural notions of an entity, and NER is further complicated by inconsistent transliteration practices prevalent in Orientalist scholarship. Indeed, initiatives exist which begin to tackle the inequities within the field of digital textual analysis. For example, the ongoing #NewNLP workshop (https://newnlp.princeton.edu/about/) is fostering the development of language resources for various world languages. We would like to underscore, however, that the inequities are still present for NLP approaches to English corpora historical in nature and originating from non-metropolitan environments. Our paper reports on collaborative work attempting to close this gap for collections in English, containing transliterated names coming from Arabic-speaking or adjacent Muslim cultures. We present an approach to extracting a variety of named entities (NE) from unstructured historical datasets from open digital collections dealing with a space of the informal British empire—the nineteenth- and early twentieth-century Persian Gulf region. The sources are largely concerned with people, places, tribes and transactions in the region, yet models in state-of-the-art NER systems function with limited sets of tags and they do not capture many of the entities of interest to the historian and do not perform well with entities transliterated from other languages. We build custom spaCy-based NER models trained on domain-specific annotated datasets: the correspondence ledgers of the British Colonial Residency in Bushire and the encyclopedic Gazetteer of the Persian Gulf, Central Arabia and Oman attributed to John Gordon Lorimer. We also extend the set of named entity labels provided by spaCy and focus on detecting entities of non-Western origin, particularly from Arabic and Farsi. We test and compare performance of the blank, pre-trained and merged spaCy-based models and suggest further improvements. Our study makes an intervention into thinking beyond Western cultural notions of the entity in digital historical research.	en
dc.language.iso	en_US	en
dc.publisher	CDHU/Department of ALM, Uppsala University, Sweden	en
dc.rights	CC BY NC SA 4.0 International	en
dc.subject	Named Entity Recognition	en
dc.subject	Gulf Studies	en
dc.subject	Colonial Archives	en
dc.subject	Transliterated Names	en
dc.subject	custom models	en
dc.title	Fine Tuning NER with spaCy for Transliterated Entities Found in Digital Collections From the Multilingual Arabian/Persian Gulf	en
dc.type	Article	en
Appears in Collections:	David Wrisley's Collection

Files in This Item:

File	Description	Size	Format
Kapanetal_DHNB2022abstract.pdf	DHNB2022abstract	517.92 kB	Adobe PDF	View/Open

Show simple item record