Archive for June, 2017

Head of Digital Research at the National Archives

Wednesday, June 21st, 2017

Posted for Olivia Pinto (National Archives, Kew, UK):

Job Opportunity at The National Archives

Head of Digital Research

About the role

The National Archives has set itself the ambition of becoming a digital archive by instinct and design. The digital strategy takes this forward through the notion of a disruptive archive which positively reimagines established archival practice, and develops new ways of solving core digital challenges. You will develop a research programme to progress this vision, to answer key questions for TNA and the Archives Sector around digital archival practice and delivery. You will understand and navigate through the funding landscape, identifying key funders (RCUK and others) to build relations at a senior level to articulate priorities around digital archiving, whilst taking a key role in coordinating digitally focused research bids. You will also build key collaborative relationships with academic partners and undertake horizon scanning of the research landscape, tracking and engaging with relevant research projects nationally and internationally. You will also recognise the importance of developing an evidence base for our research into digital archiving and will lead on the development of methods for measuring impact.

About you

As someone who will be mentoring and managing a team of researchers, as well as leading on digital programing across the organisation, you’ll need to be a natural at inspiring and engaging the people you work with. You will also have the confidence to engage broadly with external stakeholders and partners. Your background and knowledge of digital research, relevant in the context of a memory institution such as The National Archives, will gain you the respect you need to deliver an inspiring digital research programme. You combine strategic leadership with a solid understanding of the digital research landscape as well as the tools and technologies that will underpin the development of a digital research programme. You will come with a strong track record in digital research, a doctorate in a discipline relevant to our digital research agenda, and demonstrable experience of relationship development at a senior level with the academic and research sectors.

Join us here in beautiful Kew, just 10 minutes walk from the Overground and Underground stations, and you can expect an excellent range of benefits. They include a pension, flexible working and childcare vouchers, as well as discounts with local businesses. We also offer well-being resources (e.g. onsite therapists) and have an on-site gym, restaurant, shop and staff bar.

To apply please follow the link: https://www.civilservicejobs.service.gov.uk/csr/jobs.cgi?jcode=1543657

Salary: £41,970

Closing date: Wednesday 28th June 2017

Pleiades sprint on Pompeian buildings

Tuesday, June 20th, 2017

Casa della Statuetta Indiana, Pompei.

Monday the 26th of June, from 15 to 17 BST, Pleiades organises an editing sprint to create additional URIs for Pompeian buildings, preferably looking at those located in Regio I, Insula 8.

Participants will meet remotely on the Pleiades IRC chat. Providing monument-specific IDs will enable a more efficient and granular use and organisation of Linked Open Data related to Pompeii, and will support the work of digital projects such as the Ancient Graffiti.

Everyone is welcome to join, but a Pleiades account is required to edit the online gazetteer.

OEDUc: EDH and Pelagios NER working group

Monday, June 19th, 2017

Participants:  Orla Murphy, Sarah Middle, Simona Stoyanova, Núria Garcia Casacuberta

Report: https://github.com/EpiDoc/OEDUc/wiki/EDH-and-Pelagios-NER

The EDH and Pelagios NER working group was part of the Open Epigraphic Data Unconference held on 15 May 2017. Our aim was to use Named Entity Recognition (NER) on the text of inscriptions from the Epigraphic Database Heidelberg (EDH) to identify placenames, which could then be linked to their equivalent terms in the Pleiades gazetteer and thereby integrated with Pelagios Commons.

Data about each inscription, along with the inscription text itself, is stored in one XML file per inscription. In order to perform NER, we therefore first had to extract the inscription text from each XML file (contained within <ab></ab> tags), then strip out any markup from the inscription to leave plain text. There are various Python libraries for processing XML, but most of these turned out to be a bit too complex for what we were trying to do, or simply returned the identifier of the <ab> element rather than the text it contained.

Eventually, we found the Python library Beautiful Soup, which converts an XML document to structured text, from which you can identify your desired element, then strip out the markup to convert the contents of this element to plain text. It is a very simple and elegant solution with only eight lines of code to extract and convert the inscription text from one specific file. The next step is to create a script that will automatically iterate through all files in a particular folder, producing a directory of new files that contain only the plain text of the inscriptions.

Once we have a plain text file for each inscription, we can begin the process of named entity extraction. We decided to follow the methods and instructions shown in the two Sunoikisis DC classes on Named Entity Extraction:

https://github.com/SunoikisisDC/SunoikisisDC-2016-2017/wiki/Named-Entity-Extraction-I

https://github.com/SunoikisisDC/SunoikisisDC-2016-2017/wiki/Named-Entity-Extraction-II

Here is a short outline of the steps might involve when this is done in the future.

  1. Extraction
    1. Split text into tokens, make a python list
    2. Create a baseline
      1. cycle through each token of the text
      2. if the token starts with a capital letter it’s a named entity (only one type, i.e. Entity)
    3. Classical Language Toolkit (CLTK)
      1. for each token in a text, the tagger checks whether that token is contained within a predefined list of possible named entities
      2. Compare to baseline
    4. Natural Language Toolkit (NLTK)
      1. Stanford NER Tagger for Italian works well with Latin
      2. Differentiates between different kinds of entities: place, person, organization or none of the above, more granular than CLTK
      3. Compare to both baseline and CLTK lists
  2. Classification
    1. Part-Of-Speech (POS) tagging – precondition before you can perform any other advanced operation on a text, information on the word class (noun, verb etc.); TreeTagger
    2. Chunking – sub-dividing a section of text into phrases and/or meaningful constituents (which may include 1 or more text tokens); export to IOB notation
    3. Computing entity frequency
  3. Disambiguation

Although we didn’t make as much progress as we would have liked, we have achieved our aim of creating a script to prepare individual files for NER processing, and have therefore laid the groundwork for future developments in this area. We hope to build on this work to successfully apply NER to the inscription texts in the EDH in order to make them more widely accessible to researchers and to facilitate their connection to other, similar resources, like Pelagios.

OEDUc: Images and Image metadata working group

Tuesday, June 13th, 2017

Participants: Sarah Middle, Angie Lumezeanu, Simona Stoyanova
Report: https://github.com/EpiDoc/OEDUc/wiki/Images-and-image-metadata

 

The Images and Image Metadata working group met at the London meeting of the Open Epigraphic Data Unconference on Friday, May 15, 2017, and discussed the issues of copyright, metadata formats, image extraction and licence transparency in the Epigraphik Fotothek Heidelberg, the database which contains images and metadata relating to nearly forty thousand Roman inscriptions from collections around the world. Were the EDH to lose its funding and the website its support, one of the biggest and most useful digital epigraphy projects will start disintegrating. While its data is available for download, its usability will be greatly compromised. Thus, this working group focused on issues pertaining to the EDH image collection. The materials we worked with are the JPG images as seen on the website, and the images metadata files which are available as XML and JSON data dumps on the EDH data download page.

The EDH Photographic Database index page states: “The digital image material of the Photographic Database is with a few exceptions directly accessible. Hitherto it had been the policy that pictures with unclear utilization rights were presented only as thumbnail images. In 2012 as a result of ever increasing requests from the scientific community and with the support of the Heidelberg Academy of the Sciences this policy has been changed. The approval of the institutions which house the monuments and their inscriptions is assumed for the non commercial use for research purposes (otherwise permission should be sought). Rights beyond those just mentioned may not be assumed and require special permission of the photographer and the museum.”

During a discussion with Frank Grieshaber we found out that the information in this paragraph is only available on this webpage, with no individual licence details in the metadata records of the images, either in the XML or the JSON data dumps. It would be useful to be included in the records, though it is not clear how to accomplish this efficiently for each photograph, since all photographers need to be contacted first. Currently, the rights information in the XML records says “Rights Reserved – Free Access on Epigraphischen Fotothek Heidelberg”, which presumably points to the “research purposes” part of the statement on the EDH website.

All other components of EDH – inscriptions, bibliography, geography and people RDF – have been released under Creative Commons Attribution-ShareAlike 3.0 Unported license, which allows for their reuse and repurposing, thus ensuring their sustainability. The images, however, will be the first thing to disappear once the project ends. With unclear licensing and the impossibility of contacting every single photographer, some of whom are not alive anymore and others who might not wish to waive their rights, data reuse becomes particularly problematic.

One possible way of figuring out the copyright of individual images is to check the reciprocal links to the photographic archive of the partner institutions who provided the images, and then read through their own licence information. However, these links are only visible from the HTML and not present in the XML records.

Given that the image metadata in the XML files is relatively detailed and already in place, we decided to focus on the task of image extraction for research purposes, which is covered by the general licensing of the EDH image databank. We prepared a Python script for batch download of the entire image databank, available on the OEDUc GitHub repo. Each image has a unique identifier which is the same as its filename and the final string of its URL. This means that when an inscription has more than one photograph, each one has its individual record and URI, which allows for complete coverage and efficient harvesting. The images are numbered sequentially, and in the case of a missing image, the process skips that entry and continues on to the next one. Since the databank includes some 37,530 plus images, the script pauses for 30 seconds after every 200 files to avoid a timeout. We don’t have access to the high resolution TIFF images, so this script downloads the JPGs from the HTML records.

The EDH images included in the EAGLE MediaWiki are all under an open licence and link back to the EDH databank. A task for the future will be to compare the two lists to get a sense of the EAGLE coverage of EDH images and feed back their licensing information to the EDH image records. One issue is the lack of file-naming conventions in EAGLE, where some photographs carry a publication citation (CIL_III_14216,_8.JPG, AE_1957,_266_1.JPG), a random name (DR_11.jpg) and even a descriptive filename which may contain an EDH reference (Roman_Inscription_in_Aleppo,_Museum,_Syria_(EDH_-_F009848).jpeg). Matching these to the EDH databank will have to be done by cross-referencing the publication citations either in the filename or in the image record.

A further future task could be to embed the image metadata into the image itself. The EAGLE MediaWiki images already have the Exif data (added automatically by the camera) but it might be useful to add descriptive and copyright information internally following the IPTC data set standard (e.g. title, subject, photographer, rights etc). This will help bring the inscription file, image record and image itself back together, in the event of data scattering after the end of the project. Currently linkage exist between the inscription files and image records. Embedding at least the HD number of the inscription directly into the image metadata will allow us to gradually bring the resources back together, following changes in copyright and licensing.

Out of the three tasks we set out to discuss, one turned out to be impractical and unfeasible, one we accomplished and published the code, one remains to be worked on in the future. Ascertaining the copyright status of all images is physically impossible, so all future experiments will be done on the EDH images in EAGLE MediaWiki. The script for extracting JPGs from the HTML is available on the OEDUc GitHub repo. We have drafted a plan for embedding metadata into the images, following the IPTC standard.

Open Epigraphic Data Unconference report

Wednesday, June 7th, 2017

Last month, a dozen or so scholars met in London (and were joined by a similar number via remote video-conference) to discuss and work on the open data produced by the Epigraphic Database Heidelberg. (See call and description.)

Over the course of the day seven working groups were formed, two of which completed their briefs within the day, but the other five will lead to ongoing work and discussion. Fuller reports from the individual groups will follow here shortly, but here is a short summary of the activities, along with links to the pages in the Wiki of the OEDUc Github repository.

Useful links:

  1. All interested colleagues are welcome to join the discussion group: https://groups.google.com/forum/#!forum/oeduc
  2. Code, documentation, and other notes are collected in the Github repository: https://github.com/EpiDoc/OEDUc

1. Disambiguating EDH person RDF
(Gabriel Bodard, Núria García Casacuberta, Tom Gheldof, Rada Varga)
We discussed and broadly specced out a couple of steps in the process for disambiguating PIR references for inscriptions in EDH that contain multiple personal names, for linking together person references that cite the same PIR entry, and for using Trismegistos data to further disambiguate EDH persons. We haven’t written any actual code to implement this yet, but we expect a few Python scripts would do the trick.

2. Epigraphic ontology
(Hugh Cayless, Paula Granados, Tim Hill, Thomas Kollatz, Franco Luciani, Emilia Mataix, Orla Murphy, Charlotte Tupman, Valeria Vitale, Franziska Weise)
This group discussed the various ontologies available for encoding epigraphic information (LAWDI, Nomisma, EAGLE Vocabularies) and ideas for filling the gaps between this. This is a long-standing desideratum of the EpiDoc community, and will be an ongoing discussion (perhaps the most important of the workshop).

3. Images and image metadata
(Angie Lumezeanu, Sarah Middle, Simona Stoyanova)
This group attempted to write scripts to track down copyright information on images in EDH (too complicated, but EAGLE may have more of this), download images and metadata (scripts in Github), and explored the possibility of embedding metadata in the images in IPTC format (in progress).

4. EDH and SNAP:DRGN mapping
(Rada Varga, Scott Vanderbilt, Gabriel Bodard, Tim Hill, Hugh Cayless, Elli Mylonas, Franziska Weise, Frank Grieshaber)
In this group we revised the status of SNAP:DRGN recommendations for person-data in RDF, and then looked in detail about the person list exported from the EDH data. A list of suggestions for improving this data was produced for EDH to consider. This task was considered to be complete. (Although Frank may have feedback or questions for us later.)

5. EDH and Pelagios NER
(Orla Murphy, Sarah Middle, Simona Stoyanova, Núria Garcia Casacuberta, Thomas Kollatz)
This group explored the possibility of running machine named entity extraction on the Latin texts of the EDH inscriptions, in two stages: extracting plain text from the XML (code in Github); applying CLTK/NLTK scripts to identify entities (in progress).

6. EDH and Pelagios location disambiguation
(Paula Granados, Valeria Vitale, Franco Luciani, Angie Lumezeanu, Thomas Kollatz, Hugh Cayless, Tim Hill)
This group aimed to work on disambiguating location information in the EDH data export, for example making links between Geonames place identifiers, TMGeo places, Wikidata and Pleiades identifiers, via the Pelagios gazetteer or other linking mechanisms. A pathway for resolving was identified, but work is still ongoing.

7. Exist-db mashup application
(Pietro Liuzzo)
This task, which Dr Liuzzo carried out alone, since his network connection didn’t allow him to join any of the discussion groups on the day, was to create an implementation of existing code for displaying and editing epigraphic editions (using Exist-db, Leiden+, etc.) and offer a demonstration interface by which the EDH data could be served up to the public and contributions and improvements invited. (A preview “epigraphy.info” perhaps?)