Title: | Fine Tuning NER with spaCy for Transliterated Entities Found in Digital Collections From the Multilingual Arabian/Persian Gulf |
Authors: | Kapan, Almazhan Kirmizialtin, Suphan Kukreja, Rhythm Wrisley, David Joseph |
Keywords: | Named Entity Recognition;Gulf Studies;Colonial Archives;Transliterated Names;custom models |
Issue Date: | Mar-2022 |
Publisher: | CDHU/Department of ALM, Uppsala University, Sweden |
Citation: | Kapan, Almazhan, Kirmizialtin, Suphan, Kukreja, Rhythm and Wrisley, David Joseph. (2022). Fine Tuning NER with spaCy for Transliterated Entities Found in Digital Collections From the Multilingual Arabian/Persian Gulf in DHNB 2022 Conference Book of Abstracts (Uppsala, 15-18 March 2022). 69-70. |
Abstract: | Searchable, transcribed cultural heritage text collections have become an important part of digital GLAM. With the democratization of handwritten text recognition (HTR) platforms, this trend of studying and reusing more texts from archives will no doubt continue. The situation presents an ethical dilemma for computational study, however, since archival materials, particularly those of an intercultural nature, or those written in metropolitan languages about the colonized world are poorly served by text processing methods. Processes like named entity recognition (NER), which allow for the semi-automated annotation of transcribed archives, hold much promise for network and spatial analysis of the sources of the historical humanities, and yet state-of-the-art NER systems (e.g. NLTK) are generally trained on English-language corpora with a metropolitan, media focus and, thus, do not accurately identify non-English entities and labels. They do not offer much in terms of tag customization to move beyond western cultural notions of an entity, and NER is further complicated by inconsistent transliteration practices prevalent in Orientalist scholarship. Indeed, initiatives exist which begin to tackle the inequities within the field of digital textual analysis. For example, the ongoing #NewNLP workshop (https://newnlp.princeton.edu/about/) is fostering the development of language resources for various world languages. We would like to underscore, however, that the inequities are still present for NLP approaches to English corpora historical in nature and originating from non-metropolitan environments. Our paper reports on collaborative work attempting to close this gap for collections in English, containing transliterated names coming from Arabic-speaking or adjacent Muslim cultures. We present an approach to extracting a variety of named entities (NE) from unstructured historical datasets from open digital collections dealing with a space of the informal British empire—the nineteenth- and early twentieth-century Persian Gulf region. The sources are largely concerned with people, places, tribes and transactions in the region, yet models in state-of-the-art NER systems function with limited sets of tags and they do not capture many of the entities of interest to the historian and do not perform well with entities transliterated from other languages. We build custom spaCy-based NER models trained on domain-specific annotated datasets: the correspondence ledgers of the British Colonial Residency in Bushire and the encyclopedic Gazetteer of the Persian Gulf, Central Arabia and Oman attributed to John Gordon Lorimer. We also extend the set of named entity labels provided by spaCy and focus on detecting entities of non-Western origin, particularly from Arabic and Farsi. We test and compare performance of the blank, pre-trained and merged spaCy-based models and suggest further improvements. Our study makes an intervention into thinking beyond Western cultural notions of the entity in digital historical research. |
URI: | https://dhnb.eu/conferences/dhnb2022/conference-program/ http://hdl.handle.net/2451/63845 |
Rights: | CC BY NC SA 4.0 International |
Appears in Collections: | David Wrisley's Collection |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Kapanetal_DHNB2022abstract.pdf | DHNB2022abstract | 517.92 kB | Adobe PDF | View/Open |
Items in FDA are protected by copyright, with all rights reserved, unless otherwise indicated.