Skip navigation

Fine-Tuning NER with spaCy for Transliterated Entities Found in Digital Collections From the Multilingual Persian Gulf

Authors: Kapan, Almazhan
Kirmizialtin, Suphan
Kukreja, Rhythm
Wrisley, David Joseph
Keywords: Named Entity Recognition, Gulf Studies, Colonial Archives, Persian Gulf, spaCy, Transliterated Names
Issue Date: 6-Oct-2022
Publisher: CEUR Workshop Proceedings
Citation: Kapan, Almazhan, Suphan Kirmizialtin, Rhythm Kukreja and David Joseph Wrisley. (2022). Fine-Tuning NER with spaCy for Transliterated Entities Found in Digital Collections From the Multilingual Persian Gulf. Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022) Uppsala, Sweden, March 15-18, 2022. Eds. Karl Berglund, Matti La Mela and Inge Zwart. 288-296.
Abstract: Text recognition technologies increase access to global archives and make possible their computational study using techniques such as Named Entity Recognition (NER). In this paper, we present an approach to extracting a variety of named entities (NE) in unstructured historical datasets from open digital collections dealing with a space of informal British empire: the Persian Gulf region. The sources are largely concerned with people, places and tribes as well as economic and diplomatic transactions in the region. Since models in state-of-the-art NER systems function with limited tag sets and are generally trained on English-language media, they struggle to capture entities of interest to the historian and do not perform well with entities transliterated from other languages. We build custom spaCy-based NER models trained on domain-specific annotated datasets. We also extend the set of named entity labels provided by spaCy and focus on detecting entities of non-Western origin, particularly from Arabic and Farsi. We test and compare performance of the blank, pre-trained and merged spaCy-based models, suggesting further improvements. Our study makes an intervention into thinking beyond Western notions of the entity in digital historical research by creating more inclusive models using non-metropolitan corpora in English.
ISSN: 1613-0073
Rights: Creative Commons License Attribution 4.0 International (CC BY 4.0)
Appears in Collections:David Wrisley's Collection

Files in This Item:
File Description SizeFormat 
Kapanetal_FineTuningNER.pdfDHNB_2022342.52 kBAdobe PDFView/Open

Items in FDA are protected by copyright, with all rights reserved, unless otherwise indicated.