OEDUc: EDH and Pelagios NER working group

Monday, June 19th, 2017

Participants:  Orla Murphy, Sarah Middle, Simona Stoyanova, Núria Garcia Casacuberta


The EDH and Pelagios NER working group was part of the Open Epigraphic Data Unconference held on 15 May 2017. Our aim was to use Named Entity Recognition (NER) on the text of inscriptions from the Epigraphic Database Heidelberg (EDH) to identify placenames, which could then be linked to their equivalent terms in the Pleiades gazetteer and thereby integrated with Pelagios Commons.

Data about each inscription, along with the inscription text itself, is stored in one XML file per inscription. In order to perform NER, we therefore first had to extract the inscription text from each XML file (contained within <ab></ab> tags), then strip out any markup from the inscription to leave plain text. There are various Python libraries for processing XML, but most of these turned out to be a bit too complex for what we were trying to do, or simply returned the identifier of the <ab> element rather than the text it contained.

Eventually, we found the Python library Beautiful Soup, which converts an XML document to structured text, from which you can identify your desired element, then strip out the markup to convert the contents of this element to plain text. It is a very simple and elegant solution with only eight lines of code to extract and convert the inscription text from one specific file. The next step is to create a script that will automatically iterate through all files in a particular folder, producing a directory of new files that contain only the plain text of the inscriptions.

Once we have a plain text file for each inscription, we can begin the process of named entity extraction. We decided to follow the methods and instructions shown in the two Sunoikisis DC classes on Named Entity Extraction:

Here is a short outline of the steps might involve when this is done in the future.

  1. Extraction
    1. Split text into tokens, make a python list
    2. Create a baseline
      1. cycle through each token of the text
      2. if the token starts with a capital letter it’s a named entity (only one type, i.e. Entity)
    3. Classical Language Toolkit (CLTK)
      1. for each token in a text, the tagger checks whether that token is contained within a predefined list of possible named entities
      2. Compare to baseline
    4. Natural Language Toolkit (NLTK)
      1. Stanford NER Tagger for Italian works well with Latin
      2. Differentiates between different kinds of entities: place, person, organization or none of the above, more granular than CLTK
      3. Compare to both baseline and CLTK lists
  2. Classification
    1. Part-Of-Speech (POS) tagging – precondition before you can perform any other advanced operation on a text, information on the word class (noun, verb etc.); TreeTagger
    2. Chunking – sub-dividing a section of text into phrases and/or meaningful constituents (which may include 1 or more text tokens); export to IOB notation
    3. Computing entity frequency
  3. Disambiguation

Although we didn’t make as much progress as we would have liked, we have achieved our aim of creating a script to prepare individual files for NER processing, and have therefore laid the groundwork for future developments in this area. We hope to build on this work to successfully apply NER to the inscription texts in the EDH in order to make them more widely accessible to researchers and to facilitate their connection to other, similar resources, like Pelagios.

Summer School in Digital Humanities (Sep 2016, Hissar, Bulgaria)

Thursday, March 3rd, 2016

The Centre for Excellence in the Humanities to the University of Sofia, Bulgaria, organizes jointly with an international team of lecturers and researchers in the field of Digital Humanities a Summer School in Digital Humanities. The Summer School will take place between 05-10 September 2016 and is targeted at historians, archaeologists, classical scholars, philologists, museum and conservation workers, linguists, researchers in translation and reception studies, specialists in cultural heritage and cultural management, textual critics and other humanitarians with little to moderate skills in IT who would like to enhance their competences. The Summer School will provide four introductory modules on the following topics:

  • Text encoding and interchange by Gabriel Bodard, University of London, and Simona Stoyanova, King’s College London: TEI, EpiDoc XML (, marking up of epigraphic monuments, authority lists, linked open data for toponymy and prosopography: SNAP:DRGN (, Pelagios (, Pleiades (
  • Text and image annotation and alignment by Simona Stoyanova, King’s College London, and Polina Yordanova, University of Sofia: SoSOL Perseids tools (, Arethusa grammatical annotation and treebanking of texts, Alpheios text and translation alignment, text/image alignment tools.
  • Geographical Information Systems and Neogeography by Maria Baramova, University of Sofia, and Valeria Vitale, King’s College London: Historical GIS, interactive map layers with historical information, using GeoNames ( and geospatial data, Recogito tool for Pelagios.
  • 3D Imaging and Modelling for Cultural Heritage by Valeria Vitale, King’s College London: photogrammetry, digital modelling of indoor and outdoor objects of cultural heritage, Meshmixer (, Sketchup ( and others.

The school is open for applications by MA and PhD students and postdoc and early researchers from all humanitarian disciplines, as well as employees in the field of cultural heritage. The applicants should send a CV and a Motivation statement clarifying their specific needs and expressing interest in one or more of the modules no later than 15.05.2016. The places are limited and you will be notified about your acceptance within 10 working days after the application deadline. Transfer from Sofia to Hissar and back, accommodation and meal expenses during the Summer School are covered by the organizers. Five scholarships of 250 euro will be accorded by the organizing committee to the participants whose work and motivation are deemed the most relevant and important.

The participation fee is 40 еurо. It covers coffee breaks, social programme and materials for the participants.

Please submit your applications to

Assoc. Prof. Dimitar Birov (Department of Informatics, University of Sofia)
Dr. Maria Baramova (Department of Balkan History, University of Sofia)
Dr. Dimitar Iliev (Department of Classics, University of Sofia)
Mirela Hadjieva (Centre for Excellence in the Humanities, University of Sofia)
Dobromir Dobrev (Centre for Excellence in the Humanities, University of Sofia)
Kristina Ferdinandova (Centre for Excellence in the Humanities, University of Sofia)