Archive for the ‘Open Source’ Category

OEDUc: EDH and Pelagios NER working group

Monday, June 19th, 2017

Participants:  Orla Murphy, Sarah Middle, Simona Stoyanova, Núria Garcia Casacuberta

Report: https://github.com/EpiDoc/OEDUc/wiki/EDH-and-Pelagios-NER

The EDH and Pelagios NER working group was part of the Open Epigraphic Data Unconference held on 15 May 2017. Our aim was to use Named Entity Recognition (NER) on the text of inscriptions from the Epigraphic Database Heidelberg (EDH) to identify placenames, which could then be linked to their equivalent terms in the Pleiades gazetteer and thereby integrated with Pelagios Commons.

Data about each inscription, along with the inscription text itself, is stored in one XML file per inscription. In order to perform NER, we therefore first had to extract the inscription text from each XML file (contained within <ab></ab> tags), then strip out any markup from the inscription to leave plain text. There are various Python libraries for processing XML, but most of these turned out to be a bit too complex for what we were trying to do, or simply returned the identifier of the <ab> element rather than the text it contained.

Eventually, we found the Python library Beautiful Soup, which converts an XML document to structured text, from which you can identify your desired element, then strip out the markup to convert the contents of this element to plain text. It is a very simple and elegant solution with only eight lines of code to extract and convert the inscription text from one specific file. The next step is to create a script that will automatically iterate through all files in a particular folder, producing a directory of new files that contain only the plain text of the inscriptions.

Once we have a plain text file for each inscription, we can begin the process of named entity extraction. We decided to follow the methods and instructions shown in the two Sunoikisis DC classes on Named Entity Extraction:

https://github.com/SunoikisisDC/SunoikisisDC-2016-2017/wiki/Named-Entity-Extraction-I

https://github.com/SunoikisisDC/SunoikisisDC-2016-2017/wiki/Named-Entity-Extraction-II

Here is a short outline of the steps might involve when this is done in the future.

  1. Extraction
    1. Split text into tokens, make a python list
    2. Create a baseline
      1. cycle through each token of the text
      2. if the token starts with a capital letter it’s a named entity (only one type, i.e. Entity)
    3. Classical Language Toolkit (CLTK)
      1. for each token in a text, the tagger checks whether that token is contained within a predefined list of possible named entities
      2. Compare to baseline
    4. Natural Language Toolkit (NLTK)
      1. Stanford NER Tagger for Italian works well with Latin
      2. Differentiates between different kinds of entities: place, person, organization or none of the above, more granular than CLTK
      3. Compare to both baseline and CLTK lists
  2. Classification
    1. Part-Of-Speech (POS) tagging – precondition before you can perform any other advanced operation on a text, information on the word class (noun, verb etc.); TreeTagger
    2. Chunking – sub-dividing a section of text into phrases and/or meaningful constituents (which may include 1 or more text tokens); export to IOB notation
    3. Computing entity frequency
  3. Disambiguation

Although we didn’t make as much progress as we would have liked, we have achieved our aim of creating a script to prepare individual files for NER processing, and have therefore laid the groundwork for future developments in this area. We hope to build on this work to successfully apply NER to the inscription texts in the EDH in order to make them more widely accessible to researchers and to facilitate their connection to other, similar resources, like Pelagios.

OEDUc: Images and Image metadata working group

Tuesday, June 13th, 2017

Participants: Sarah Middle, Angie Lumezeanu, Simona Stoyanova
Report: https://github.com/EpiDoc/OEDUc/wiki/Images-and-image-metadata

 

The Images and Image Metadata working group met at the London meeting of the Open Epigraphic Data Unconference on Friday, May 15, 2017, and discussed the issues of copyright, metadata formats, image extraction and licence transparency in the Epigraphik Fotothek Heidelberg, the database which contains images and metadata relating to nearly forty thousand Roman inscriptions from collections around the world. Were the EDH to lose its funding and the website its support, one of the biggest and most useful digital epigraphy projects will start disintegrating. While its data is available for download, its usability will be greatly compromised. Thus, this working group focused on issues pertaining to the EDH image collection. The materials we worked with are the JPG images as seen on the website, and the images metadata files which are available as XML and JSON data dumps on the EDH data download page.

The EDH Photographic Database index page states: “The digital image material of the Photographic Database is with a few exceptions directly accessible. Hitherto it had been the policy that pictures with unclear utilization rights were presented only as thumbnail images. In 2012 as a result of ever increasing requests from the scientific community and with the support of the Heidelberg Academy of the Sciences this policy has been changed. The approval of the institutions which house the monuments and their inscriptions is assumed for the non commercial use for research purposes (otherwise permission should be sought). Rights beyond those just mentioned may not be assumed and require special permission of the photographer and the museum.”

During a discussion with Frank Grieshaber we found out that the information in this paragraph is only available on this webpage, with no individual licence details in the metadata records of the images, either in the XML or the JSON data dumps. It would be useful to be included in the records, though it is not clear how to accomplish this efficiently for each photograph, since all photographers need to be contacted first. Currently, the rights information in the XML records says “Rights Reserved – Free Access on Epigraphischen Fotothek Heidelberg”, which presumably points to the “research purposes” part of the statement on the EDH website.

All other components of EDH – inscriptions, bibliography, geography and people RDF – have been released under Creative Commons Attribution-ShareAlike 3.0 Unported license, which allows for their reuse and repurposing, thus ensuring their sustainability. The images, however, will be the first thing to disappear once the project ends. With unclear licensing and the impossibility of contacting every single photographer, some of whom are not alive anymore and others who might not wish to waive their rights, data reuse becomes particularly problematic.

One possible way of figuring out the copyright of individual images is to check the reciprocal links to the photographic archive of the partner institutions who provided the images, and then read through their own licence information. However, these links are only visible from the HTML and not present in the XML records.

Given that the image metadata in the XML files is relatively detailed and already in place, we decided to focus on the task of image extraction for research purposes, which is covered by the general licensing of the EDH image databank. We prepared a Python script for batch download of the entire image databank, available on the OEDUc GitHub repo. Each image has a unique identifier which is the same as its filename and the final string of its URL. This means that when an inscription has more than one photograph, each one has its individual record and URI, which allows for complete coverage and efficient harvesting. The images are numbered sequentially, and in the case of a missing image, the process skips that entry and continues on to the next one. Since the databank includes some 37,530 plus images, the script pauses for 30 seconds after every 200 files to avoid a timeout. We don’t have access to the high resolution TIFF images, so this script downloads the JPGs from the HTML records.

The EDH images included in the EAGLE MediaWiki are all under an open licence and link back to the EDH databank. A task for the future will be to compare the two lists to get a sense of the EAGLE coverage of EDH images and feed back their licensing information to the EDH image records. One issue is the lack of file-naming conventions in EAGLE, where some photographs carry a publication citation (CIL_III_14216,_8.JPG, AE_1957,_266_1.JPG), a random name (DR_11.jpg) and even a descriptive filename which may contain an EDH reference (Roman_Inscription_in_Aleppo,_Museum,_Syria_(EDH_-_F009848).jpeg). Matching these to the EDH databank will have to be done by cross-referencing the publication citations either in the filename or in the image record.

A further future task could be to embed the image metadata into the image itself. The EAGLE MediaWiki images already have the Exif data (added automatically by the camera) but it might be useful to add descriptive and copyright information internally following the IPTC data set standard (e.g. title, subject, photographer, rights etc). This will help bring the inscription file, image record and image itself back together, in the event of data scattering after the end of the project. Currently linkage exist between the inscription files and image records. Embedding at least the HD number of the inscription directly into the image metadata will allow us to gradually bring the resources back together, following changes in copyright and licensing.

Out of the three tasks we set out to discuss, one turned out to be impractical and unfeasible, one we accomplished and published the code, one remains to be worked on in the future. Ascertaining the copyright status of all images is physically impossible, so all future experiments will be done on the EDH images in EAGLE MediaWiki. The script for extracting JPGs from the HTML is available on the OEDUc GitHub repo. We have drafted a plan for embedding metadata into the images, following the IPTC standard.

Open Epigraphic Data Unconference report

Wednesday, June 7th, 2017

Last month, a dozen or so scholars met in London (and were joined by a similar number via remote video-conference) to discuss and work on the open data produced by the Epigraphic Database Heidelberg. (See call and description.)

Over the course of the day seven working groups were formed, two of which completed their briefs within the day, but the other five will lead to ongoing work and discussion. Fuller reports from the individual groups will follow here shortly, but here is a short summary of the activities, along with links to the pages in the Wiki of the OEDUc Github repository.

Useful links:

  1. All interested colleagues are welcome to join the discussion group: https://groups.google.com/forum/#!forum/oeduc
  2. Code, documentation, and other notes are collected in the Github repository: https://github.com/EpiDoc/OEDUc

1. Disambiguating EDH person RDF
(Gabriel Bodard, Núria García Casacuberta, Tom Gheldof, Rada Varga)
We discussed and broadly specced out a couple of steps in the process for disambiguating PIR references for inscriptions in EDH that contain multiple personal names, for linking together person references that cite the same PIR entry, and for using Trismegistos data to further disambiguate EDH persons. We haven’t written any actual code to implement this yet, but we expect a few Python scripts would do the trick.

2. Epigraphic ontology
(Hugh Cayless, Paula Granados, Tim Hill, Thomas Kollatz, Franco Luciani, Emilia Mataix, Orla Murphy, Charlotte Tupman, Valeria Vitale, Franziska Weise)
This group discussed the various ontologies available for encoding epigraphic information (LAWDI, Nomisma, EAGLE Vocabularies) and ideas for filling the gaps between this. This is a long-standing desideratum of the EpiDoc community, and will be an ongoing discussion (perhaps the most important of the workshop).

3. Images and image metadata
(Angie Lumezeanu, Sarah Middle, Simona Stoyanova)
This group attempted to write scripts to track down copyright information on images in EDH (too complicated, but EAGLE may have more of this), download images and metadata (scripts in Github), and explored the possibility of embedding metadata in the images in IPTC format (in progress).

4. EDH and SNAP:DRGN mapping
(Rada Varga, Scott Vanderbilt, Gabriel Bodard, Tim Hill, Hugh Cayless, Elli Mylonas, Franziska Weise, Frank Grieshaber)
In this group we revised the status of SNAP:DRGN recommendations for person-data in RDF, and then looked in detail about the person list exported from the EDH data. A list of suggestions for improving this data was produced for EDH to consider. This task was considered to be complete. (Although Frank may have feedback or questions for us later.)

5. EDH and Pelagios NER
(Orla Murphy, Sarah Middle, Simona Stoyanova, Núria Garcia Casacuberta, Thomas Kollatz)
This group explored the possibility of running machine named entity extraction on the Latin texts of the EDH inscriptions, in two stages: extracting plain text from the XML (code in Github); applying CLTK/NLTK scripts to identify entities (in progress).

6. EDH and Pelagios location disambiguation
(Paula Granados, Valeria Vitale, Franco Luciani, Angie Lumezeanu, Thomas Kollatz, Hugh Cayless, Tim Hill)
This group aimed to work on disambiguating location information in the EDH data export, for example making links between Geonames place identifiers, TMGeo places, Wikidata and Pleiades identifiers, via the Pelagios gazetteer or other linking mechanisms. A pathway for resolving was identified, but work is still ongoing.

7. Exist-db mashup application
(Pietro Liuzzo)
This task, which Dr Liuzzo carried out alone, since his network connection didn’t allow him to join any of the discussion groups on the day, was to create an implementation of existing code for displaying and editing epigraphic editions (using Exist-db, Leiden+, etc.) and offer a demonstration interface by which the EDH data could be served up to the public and contributions and improvements invited. (A preview “epigraphy.info” perhaps?)

Open Epigraphic Data Unconference, London, May 15, 2017

Tuesday, May 2nd, 2017

Open Epigraphic Data Unconference
10:00–17:00, May 15, 2017, Institute of Classical Studies

This one-day workshop, or “unconference,” brings together scholars, historians and data scientists with a shared interest in classical epigraphic data. The event involves no speakers or set programme of presentations, but rather a loose agenda, to be further refined in advance or on the day, which is to use, exploit, transform and “mash-up” with other sources the Open Data recently made available by the Epigraphic Database Heidelberg under a Creative Commons license. Both present and remote participants with programming and data-processing experience, and those with an interest in discussing and planning data manipulation and aggregation at a higher level, are welcomed.

Places at the event in London are limited; please contact <gabriel.bodard@sas.ac.uk> if you would like to register to attend.

There will also be a Google Hangout opened on the day, for participants who are not able to attend in person. We hope this event will only be the beginning of a longer conversation and project to exploit and disseminate this invaluable epigraphic dataset.

Cataloguing Open Access Classics Serials

Friday, March 17th, 2017

The Institute for Classical Studies is pleased to announce the appointment of Simona Stoyanova for one year as a new Research Fellow in Library and Information Science on the Cataloguing Open Access Classics Serials (COACS) project, funded by a development grant from the School of Advanced Study.

COACS will leverage various sites that list or index open access (OA) publications, especially journals and serials, in classics and ancient history, so as to produce a resource that subject libraries may use to automatically catalogue the publications and articles therein. The project is based in the ICS, supervised by the Reader in Digital Classics, Gabriel Bodard, and the Combined Library, with the support of Paul Jackson and Sue Willetts. Other digital librarians and scholars including Richard Gartner and Raphaële Mouren in the Warburg Institute; Patrick Burns and Tom Elliott from the Institute for the Study of the Ancient World (NYU); Charles Jones from Penn State; and Matteo Romanello from the German Archaeological Institute are providing further advice.

Major stages of work will include:

  1. Survey of AWOL: We shall assess the regularity of metadata in the open access journals listed at AWOL (which currently lists 1521 OA periodicals, containing a little over 50,000 articles), and estimate what proportion of these titles expose metadata in standard formats that would enable harvesting in a form amenable to import into library catalogues. A certain amount of iteration and even manual curation of data is likely to be necessary. The intermediate dataset will need to be updated and incremented over time, rather than overwritten entirely on each import.
  2. Intermediate data format: We will also decide on the intermediate format (containing MARC data), which in addition to being ingested by the Combined Library will be made available for use by other libraries (e.g. NYU Library and the German Archaeological Institute’s Zenon catalogue). The addition of catalogued OA serials and articles to the library catalogue will significantly contribute to the research practice of scholars and other library users, enabling new research outputs from the Institute and enhancing the open access policy of the School.
  3. Further OA indexes: Once the proof-of-concept is in place, and data is being harvested from AWOL (and tested that they update rather than overwriting or duplicating pre-existing titles), we shall experiment with harvesting similar data from other indexes of OA content, such as DOAJ, OLH, Persée, DialNet, TOCS-IN, and perhaps even institutional repositories.
  4. Publish open access software: All code for harvesting OA serials and articles, and for ingest by library catalogues will be made available through Github. This code will then be available for updating the intermediate data to take advantage of new titles that are added to AWOL and other resources, and new issues of serials that are already listed. This will enable reuse of our scripts and data by other libraries and similar institutions.

By the end of the pilot project, we will have: made available and documented the intermediate dataset and harvesting and ingest code; performed a test ingest of the data into the ICS library catalogue; engaged known (NYU, Zenon, BL) and newly discovered colleagues in potentially adding to and using this data; explored the possibility of seeking external funding to take this project further.

We consider this project to be a pilot for further work, for which we intend to seek external funding once a proof of concept is in place. We hope to be able to build on this first phase of work by: extending the methods to other disciplines, especially those covered by the other institute libraries in SAS; enabling the harvest of full-text from serials whose license permit it, for search and other textual research such as text-mining and natural language processing; disambiguating enhancing the internal and external bibliographical references to enable hyperlinks to primary and secondary sources where available.

BL Labs roadshow at King’s Digital Laboratory

Friday, February 12th, 2016

BL Labs Roadshow Event: Experimenting with British Library’s Digital Content and Data for your research

King’s Digital Laboratory (KDL) is excited to announce we will be hosting a British Library Labs (BL Labs) roadshow event at King’s College London on 14 March, 2016. The roadshow is an opportunity for King’s staff and students to gain an overview of the British Library’s digital resources from the BL Labs team, and brainstorm ideas for research outputs and digital products. The workshop will showcase the British Library’s digital content and data, addressing some of the challenges and issues of working with it and how interesting and exciting projects from researchers, artists, and entrepreneurs have been developed via the annual British Library Labs Competition and Awards.

No technical ability is required and staff and students from all disciplines are warmly encouraged to attend. Guest speakers and both KDL and BL Labs staff will be present to help you explore your ideas, and develop them into project ideas and funding proposals.

When: Monday 14th March 2016, 1000-1630
Where: River Room (King’s College London, Strand)
Info: https://kdl-bl-labs-roadshow2016.eventbrite.co.uk

Programme

10:00  Registration and Coffee
10:30  Introduction and Overview of King’s Digital Lab (Dr. James Smithies, Director, King’s Digital Lab)
11:00  Getting Medieval, Getting Palaeography: Using DigiPal to Study Medieval Script and Image (Dr. Stewart Brookes, Research Associate, DDH)
11:30  Digital Research and Digitisation at the British Library (Rossitza Atanassova, Digital Curator at the British Library)
12:00  British Library Labs (Mahendra Mahey, Project Manager of British Library Labs)
12:20  Overview projects that have used British Library’s Digital Content and data (Ben O’Steen, Technical Lead of British Library Labs)
13:00  Lunch
14:00  News data at the British Library (Luke McKernan, Lead Curator News & Moving Image Collections, British Library)
14:30  Examination of British Library data and previous Labs ideas
14:45  Ideas Lab
16:00  Pitching ideas to the panel
16:30  Finish

Please note that capacity is limited. For further information and registration please follow this link: https://kdl-bl-labs-roadshow2016.eventbrite.co.uk

Harpokration Online

Thursday, February 11th, 2016

Posted for Joshua Sosin:

About eight months ago we announced a lightweight tool to support collaborative translation of Harpokration—we called it ‘Harpokration On Line.’ See: https://blogs.library.duke.edu/dcthree/2015/05/26/harpokration-on-line. Well, we took our time (Mack finished a dissertation, John made serious progress on his, Josh did his first 24+ hour bike ride), and as of this afternoon there is at least one rough translation (in some cases more than one) for every entry. http://dcthree.github.io/harpokration.
We had help from others; I mention especially Chris de Lisle, whom we have never met, but who invested considerable effort, for which all should be grateful! And many special thanks to Matthew Farmer (U Missouri) who signed on at the moment when our to-do pile contained mainly entries that we had back-burnered, while we chewed through the easier ones!
So, we are done, but far from done. Now begins the process of correcting errors and infelicities, of which there will be many; adding new features to the tool (e.g. commentary, easy linking out to related digital resources such as Jacoby Online or Pleiades, enhanced encoding in the Greek and features built atop that, perhaps eventual reconciliation of text with Keaney as warranted). This is just a start really.
For next year we (Sosin & Duke Collaboratory for Classics Computing) plan a course at Duke in which the students will (1) start translating their way through Photios’ Lexicon in similar fashion and (2) working with Ryan Baumann and Hugh Cayless of the DC3 to help design and implement expanded features for the translation tool. We will welcome collaborators on that effort as well!
So, here again, please feel free log in, fix, add, correct, disagree and so on. Please note that we do handle login via google; so, if that is a deal-breaker for you, we apologize. We have a rough workaround for that and would be happy to test it out with a few folks, if any should wish.
Matthew C. Farmer (farmermc@missouri.edu)
John P. Aldrup-MacDonald (john.smith.macdonald@duke.edu)
Mackenzie Zalin (mack.zalin@duke.edu)

Reflecting on our (first ever) Digital Classicist Wiki Sprint

Wednesday, July 16th, 2014

From (Print) Encyclopedia to (Digital) Wiki

According to Denis Diderot and Jean le Rond d’Alembert the purpose of an encyclopedia in the 18th century was ‘to collect knowledge disseminated around the globe; to set forth its general system to the people with whom we live, and transmit it to those who will come after us, so that the work of preceding centuries will not become useless to the centuries to come’.  Encyclopedias have existed for around 2,000 years; the oldest is in fact a classical text, Naturalis Historia, written ca 77 CE by Pliny the Elder.

Following the (recent) digitalization of raw data, new, digital forms of encyclopedia have emerged. In our very own, digital era, a Wiki is a wider, electronic encyclopedia that is open to contributions and edits by interesting parties. It contains concept analyses, images, media, and so on, and it is freely available, thus making the creation, recording, and dissemination of knowledge a democratised process, open to everyone who wishes to contribute.

 

A Sprint for Digital Classicists

For us, Digital Classicists, scholars and students interested in the application of humanities computing to research in the ancient and Byzantine worlds, the Digital Classicist Wiki is composed and edited by a hub for scholars and students. This wiki collects guidelines and suggestions of major technical issues, and catalogues digital projects and tools of relevance to classicists. The wiki also lists events, bibliographies and publications (print and electronic), and other developments in the field. A discussion group serves as grist for a list of FAQs. As members of the community provide answers and other suggestions, some of these may evolve into independent wiki articles providing work-in-progress guidelines and reports. The scope of the Wiki follows the interests and expertise of collaborators, in general, and of the editors, in particular. The Digital Classicist is hosted by the Department of Digital Humanities at King’s College London, and the Stoa Consortium, University of Kentucky.

So how did we end up editing this massive piece of work? On Tuesday July 1, 2014 and around 16:00 GMT (or 17:00 CET) a group of interested parties gathered up in several digital platforms. The idea was that most of the action will take place in the DigiClass chatroom on IRC, our very own channel called #digiclass. Alongside the traditional chat window, there was also a Skype voice call to get us started and discuss approaches before editing. On the side, we had a GoogleDoc where people simultaneously added what they thought should be improved or created. I was very excited to interact with old members and new. It was a fun break during my mini trip to the Netherlands, and as it proved, very focused on the general attitude of the Digital Classicists team; knowledge is open to everyone who wishes to learn and can be the outcome of a joyful collaborative process.

 

The Technology Factor

As a researcher of digital history, and I suppose most information system scholars would agree, technology is never neutral in the process of ‘making’. The magic of the Wiki consists on the fact that it is a rather simple platform that can be easily tweaked. All users were invited to edit any page to create new pages within the wiki Web site, using only a regular web browser without any extra add-ons. Wiki makes page link creation easy by showing whether an intended target page exists or not. A wiki enables communities to write documents collaboratively, using a simple markup language and a web browser. A single page in a wiki website is referred to as a wiki page, while the entire collection of pages, which are usually well interconnected by hyperlinks, is ‘the wiki’. A wiki is essentially a database for creating, browsing, and searching through information. A wiki allows non-linear, evolving, complex and networked text, argument and interaction. Edits can be made in real time and appear almost instantly online. This can facilitate abuse of the system. Private wiki servers (such as the Digital Classicist one) require user identification to edit pages, thus making the process somewhat mildly controlled. Most importantly, as researchers of the digital we understood in practice that a wiki is not a carefully crafted site for casual visitors. Instead, it seeks to involve the visitor in an ongoing process of creation and collaboration that constantly changes the Web site landscape.

 

Where Technology Shapes the Future of Humanities

In terms of Human resources some with little involvement in the Digital Classicist community before this, got themselves involved in several tasks including correcting pages, suggesting new projects, adding pages to the wiki, helping others with information and background, approaching project-owners and leaders in order to suggest adding or improving information. Collaboration, a practice usually reserved for science scholars, made the process easier and intellectually stimulating.  Moreover, within these overt cyber-spaces of ubiquitous interaction one could identify a strong sense of productive diversity within our own scholarly community; it was visible both in the IRC chat channel as well as over skype. Several different accents and spellings, British, American English, and several continental scholars were gathering up to expand this incredibly fast-pacing process. There was a need to address research projects, categories, and tools found in non-english speaking academic cultures.  As a consequence of this multivocal procedure, more interesting questions arose, not lest methodological. ‘What projects are defined as digital, really’, ‘Isn’t everything a database?’ ‘What is a prototype?’. ‘Shouldn’t there be a special category for dissertations, or visualisations?’.  The beauty of collaboration in all its glory, plus expanding our horizons with technology! And so much fun!

MediaWiki recorded almost 250 changes made in the 1st of July 2014!

The best news, however is that this, first ever wiki sprint was not the last.  In the words of the Organisers, Gabriel Boddard and Simon Mahony,

‘We have recently started a programme of short intensive work-sprints to
improve the content of the Digital Classicist Wiki
(http://wiki.digitalclassicist.org/). A small group of us this week made
about 250 edits in a couple of hours in the afternoon, and added dozens
of new projects, tools, and other information pages.

We would like to invite other members of the Digital Classicist community to
join us for future “sprints” of this kind, which will be held on the
first Tuesday of every month, at 16h00 London time (usually =17:00
Central Europe; =11:00 Eastern US).

To take part in a sprint:

1. Join us in the DigiClass chatroom (instructions at
<http://wiki.digitalclassicist.org/DigiClass_IRC_Channel>) during the
scheduled slot, and we’ll decide what to do there;

2. You will need an account on the Wiki–if you don’t already have one,
please email one of the admins to be invited;

3. You do not need to have taken part before, or to come along every
month; occasional contributors are most welcome!’

The next few sprints are scheduled for:
* August 5th
* September 2nd
* October 7th
* November 4th
* December 2nd

Please, do join us, whenever you can!

 

 

TEI Hackathon workshop at DH2014 (July 7)

Tuesday, March 25th, 2014

Call for Participation

We are inviting applications to participate in the TEI Hackathon full day workshop that will be held on July 7, 2014, as a pre-conference session at DH2014 (http://dh2014.org/).

Digital humanists, librarians, publishers, and many others use the Text Encoding Initiative (TEI) Guidelines to mark up electronic texts, and over time have created a critical mass of XML — some conforming to known subsets of the TEI Guidelines, some to individual customizations; in some cases intricate and dense, in others lean and expedient; some enriched with extensive external  metadata, others with details marked explicitly in the text. The fruits of this labor are most often destined for display online or on paper (!), indexing, and more rarely, visualisation. Techniques of processing this markup beyond display and indexing are less well-understood and not accessible to the broad community of users, however, and programmers sometimes regard TEI XML as over-complex and hard to process.

What We’ll Do

The goal of the hackathon is to make significant progress on a few projects during one day of work (from 9am to roughly 5.30pm). (more…)

Leipzig Open Fragmentary Texts Series (LOFTS)

Monday, December 16th, 2013

The Humboldt Chair of Digital Humanities at the University of Leipzig is pleased to announce a new effort within the Open Philology Project: the Leipzig Open Fragmentary Texts Series (LOFTS).

The Leipzig Open Fragmentary Texts Series is a new effort to establish open editions of ancient works that survive only through quotations and text re-uses in later texts (i.e., those pieces of information that humanists call “fragments”).

As a first step in this process, the Humboldt Chair announces the Digital Fragmenta Historicorum Graecorum (DFHG) Project, whose goal is to produce a digital edition of the five volumes of Karl Müller’s Fragmenta Historicorum Graecorum (FHG) (1841-1870), which is the first big collection of fragments of Greek historians ever realized.

For further information, please visit the project website at: http://www.dh.uni-leipzig.de/wo/open-philology-project/the-leipzig-open-fragmentary-texts-series-lofts/

Publishing Text for a Digital Age

Friday, December 6th, 2013

March 27-30, 2014
Tufts University
Medford MA
perseus_neh (at) tufts.edu
http://sites.tufts.edu/digitalagetext/2014-workshop/

Call for contributions!

As a follow-on to Working with Text in a Digital Age, an NEH-funded Institute for Advanced Technologies in the Digital Humanities and in collaboration with the Open Philology Project at the University of Leipzig, Tufts University announces a two-day workshop on publishing textual data that is available under an open license, that is structured for machine analysis as well as human inspection, and that is in a format that can be preserved over time. The purpose of this workshop is to establish specific guidelines for digital publications that publish and/or annotate textual sources from the human record. The registration for the workshop will be free but space will be limited. Some support for travel and expenses will be available. We particularly encourage contributions from students and early-career researchers.

Textual data can include digital versions of traditional critical editions and translations but such data also includes annotations that make traditional tasks (such as looking up or quoting a primary source) machine-actionable, annotations that may build upon print antecedents (e.g., dynamic indexes of places that can be used to generate maps and geospatial visualizations), and annotations that are only feasible in a digital space (such as alignments between source text and translation or exhaustive markup of morphology, syntax, and other linguistic features).

Contributions can be of two kinds:

  1. Collections of textual data that conform to existing guidelines listed below. These collections must include a narrative description of their contents, how they were produced and what audiences and purposes they were designed to serve.
  2. Contributions about formats for publication. These contributions must contain sufficient data to illustrate their advantages and to allow third parties to develop new materials.

All textual data must be submitted under a Creative Commons license. Where documents reflect a particular point of view by a particular author and where the original expression should for that reason not be changed, they may be distributed under a CC-BY-ND license. All other contributions must be distributed under a CC-BY-SA license. Most publications may contain data represented under both categories: the introduction to an edition or a data set, reflecting the reasons why one or more authors made a particular set of decisions, can be distributed under a CC-BY-ND license. All data sets (such as geospatial annotation, morphosyntactic analyses, reconstructed texts with textual notes, diplomatic editions, translations) should be published under a CC-BY-SA license.

Contributors should submit abstracts of up to 500 words to EasyChair. We particularly welcome abstracts that describe data already available under a Creative Commons license. 

Dates:

January 1, 2014:  Submissions are due. Please submit via EasyChair.

January 20, 2014:  Notification.

Duke Collaboratory for Classics Computing (DC3)

Wednesday, May 8th, 2013

Colleagues:

We are very pleased to announce the creation of the Duke Collaboratory for Classics Computing (DC3), a new Digital Classics R&D unit embedded in the Duke University Libraries, whose start-up has been generously funded by the Andrew W. Mellon Foundation and Duke University’s Dean of Arts & Sciences and Office of the Provost.

The DC3 goes live 1 July 2013, continuing a long tradition of collaboration between the Duke University Libraries and papyrologists in Duke’s Department of Classical Studies. The late Professors William H. Willis and John F. Oates began the Duke Databank of Documentary Papyri (DDbDP) more than 30 years ago, and in 1996 Duke was among the founding members of the Advanced Papyrological Information System (APIS). In recent years, Duke led the Mellon-funded Integrating Digital Papyrology effort, which brought together the DDbDP, Heidelberger Gesamtverzeichnis der Griechischen Papyrusurkunden Ägyptens (HGV), and APIS in a common search and collaborative curation environment (papyri.info), and which collaborates with other partners, including Trismegistos, Bibliographie Papyrologique, Brussels Coptic Database, and the Arabic Papyrology Database.

The DC3 team will see to the maintenance and enhancement of papyri.info data and tooling, cultivate new partnerships in the papyrological domain, experiment in the development of new complementary resources, and engage in teaching and outreach at Duke and beyond.

The team’s first push will be in the area of Greek and Latin Epigraphy, where it plans to leverage its papyrological experience to serve a much larger community. The team brings a wealth of experience in fields like image processing, text engineering, scholarly data modeling, and building scalable web services. It aims to help create a system in which the many worldwide digital epigraphy projects can interoperate by linking into the graph of scholarly relationships while maintaining the full force of their individuality.

The DC3 team is:

Ryan BAUMANN: Has worked on a wide range of Digital Humanities projects, from applying advanced imaging and visualization techniques to ancient artifacts, to developing systems for scholarly editing and collaboration.

Hugh CAYLESS: Has over a decade of software engineering expertise in both academic and industrial settings. He also holds a Ph.D. in Classics and a Master’s in Information Science. He is one of the founders of the EpiDoc collaborative and currently serves on the Technical Council of the Text Encoding Initiative.

Josh SOSIN: Associate Professor of Classical Studies and History, Co-Director of the DDbDP, Associate editor of Greek, Roman, and Byzantine Studies; an epigraphist and papyrologist interested in the intersection of ancient law, religion, and the economy.

 

Official Release of the Virtual Research Environment TextGrid

Friday, April 27th, 2012

TextGrid (http://www.textgrid.de) is a platform for scholars in the humanities, which makes possible the collaborative analysis, evaluation and publication of cultural remains (literary sources, images and codices) in a standardized way. The central idea was to bring together instruments for the dealing with texts under a common user interface. The workbench offers a range of tools and services for scholarly editing and linguistic research, which are extensible by open interfaces, such as editors for the linkage between texts or between text sequences and images, tools for musical score edition, for gloss editing, for automatic collation etc.

On the occasion of the official release of TextGrid 2.0 a summit will take place from the 14th to the 15th of May 2012. On the 14th the summit will start with a workshop day on which the participants can get an insight into some of the new tools. For the following day lectures and a discussion group are planned.

For more information and registration see this German website:

http://www.textgrid.de/summit2012

With kind regards

Celia Krause


Celia Krause
Technische Universität Darmstadt
Institut für Sprach- und Literaturwissenschaft
Hochschulstrasse 1
64289 Darmstadt
Tel.: 06151-165555

TILE 1.0 released

Friday, July 22nd, 2011

Those who have been waiting impatiently for the first stable release of the Text Image Linking Environment (TILE) toolkit need wait no longer: the full program can be downloaded from: <http://mith.umd.edu/tile/>. From that site:

The Text-Image Linking Environment (TILE) is a web-based tool for creating and editing image-based electronic editions and digital archives of humanities texts.

TILE features tools for importing and exporting transcript lines and images of text, an image markup tool, a semi-automated line recognizer that tags regions of text within an image, and plugin architecture to extend the functionality of the software.

I haven’t tried TILE out for myself yet, but I’m looking forward to doing so.

Open Access and Citation Impact

Wednesday, November 17th, 2010

A recent study published in the Public Library of Science has tested the relationship between Open Access self-archiving of peer-reviewed articles and improved citation impact.

See: Gargouri Y et al. ‘Self-Selected or Mandated, Open Access Increases Citation Impact for Higher Quality Research’ PLoS ONE 5(10)

The correlation between publications that are freely available online and high citation metrics has been established many times before and is unarguable, but some have questioned (in what strikes me as a stretch of reasoning) whether this correlation can be taken to imply causation. (In other words, they argue, “Yeah but, maybe those open access papers are cited more because people would only upload their really good papers to the Web that would be cited a lot anyway!”) Harnad and co. demonstrate pretty conclusively using controlled and tested methods that both voluntarily self-archived papers, and those that are required by funding bodies or institutions to be openly archived, have the same beneficial impact on citation, *and* that this benefit is proportionally even greater for the most high-impact publications.

Like I say, we kind of knew this, but we now have a scientific publication we can cite to demonstrate it even to the skeptics.

GRBS Free Online

Wednesday, July 22nd, 2009

Recently circulated by Joshua Sosin:

Volume 49 (2009) will be the last volume of GRBS printed on paper. Beginning with volume 50, issues will be published quarterly on-line on the GRBS website, on terms of free access. We undertake this transformation in the hope of affording our authors a wider readership; out of concern for the financial state of our libraries; and in the belief that the dissemination of knowledge should be free.

The current process of submission and peer-review of papers will continue unchanged. The on-line format will be identical with our pages as now printed, and so articles will continue to be cited by volume, year, and page numbers.

Our hope is that both authors and readers will judge this new medium to be to their advantage, and that such open access will be of benefit to continuing scholarship on Greece.

– The editors

http://www.duke.edu/web/classics/grbs

(I for one think this is great news: we know that online publications are read and cited some orders of magnitude more widely than dead tree volumes; we also know that many academic journals are largely edited, administered, peer-reviewed and proof-read by a volunteer staff of academics who see none of the profit for expensive volumes–so why not cut out the middleman and publish these high-quality products directly to the audience?)

The Digital Archimedes Palimpsest Released

Wednesday, October 29th, 2008

Very exciting news – the complete dataset of the Archimedes Palimpsest project (ten years in the making) has been released today. The official announcement is copied below, but I’d like to point out what I think it is that makes this project so special. It isn’t the object – the manuscript – or the content – although I’m sure the previously unknown texts are quite exciting for scholars. It isn’t even the technology, which includes multispectral imaging used to separate out the palimpsest from the overlying text and the XML transcriptions mapped to those images (although that’s a subject close to my heart).

What’s special about this project is its total dedication to open access principles, and an implied trust in the way it is being released that open access will work. There is no user interface. Instead, all project data is being released under a Creative Commons 3.0 attribution license. Under this license, anyone can take this data and do whatever they want to with it (even sell it), as long as they attribute it to the Archimedes Palimpsest project. The thinking behind this is that, by making the complete project data available, others will step up and build interfaces… create searches… make visualizations… do all kinds of cool stuff with the data that the developers might not even consider.

To be fair, this isn’t the only project I know of that is operating like this; the complete high-resolution photographs and accompanying metadata for manuscripts digitized through the Homer Multitext project are available freely, as the other project data will be when it’s completed, although the HMT as far as I know will also have its own user interface. There may be others as well. But I’m impressed that the project developers are releasing just the data, and trusting that scholars and others will create user environments of their own.

The Stoa was founded on principles of open access. It’s validating to see a high-visibility project such as the Archimedes Palimpsest take those principles seriously.

Ten years ago today, a private American collector purchased the Archimedes Palimpsest. Since that time he has guided and funded the project to conserve, image, and study the manuscript. After ten years of work, involving the expertise and goodwill of an extraordinary number of people working around the world, the Archimedes Palimpsest Project has released its data. It is a historic dataset, revealing new texts from the ancient world. It is an integrated product, weaving registered images in many wavebands of light with XML transcriptions of the Archimedes and Hyperides texts that are spatially mapped to those images. It has pushed boundaries for the imaging of documents, and relied almost exclusively on current international standards. We hope that this dataset will be a persistent digital resource for the decades to come. We also hope it will be helpful as an example for others who are conducting similar work. It published under a Creative Commons 3.0 attribution license, to ensure ease of access and the potential for widespread use. A complete facsimile of the revealed palimpsested texts is available on Googlebooks as “The Archimedes Palimpsest”. It is hoped that this is the first of many uses to which the data will be put.

For information on the Archimedes Palimpsest Project, please visit: www.archimedespalimpsest.org

For the dataset, please visit: www.archimedespalimpsest.net

We have set up a discussion forum on the Archimedes Palimpsest Project. Any member can invite anybody else to join. If you want to become a member, please email:

wnoel@thewalters.org

I would be grateful if you would circulate this to your friends and colleagues.

Thank you very much

Will Noel
The Walters Art Museum
October 29th, 2008.

UMich libraries goes creative-commons

Monday, October 20th, 2008

Via Open-Access News we learn:

The University of Michigan Library has decided to adopt Creative Commons Attribution-Non-Commercial licenses for all works created by the Library for which the Regents of the University of Michigan hold the copyrights. These works include bibliographies, research guides, lesson plans, and technology tutorials.

Legal guide to GPL compliance

Sunday, September 14th, 2008

I posted a few weeks ago on a guide to citing Creative Commons works, and just a short while later I saw this not directly related story about a Practical Guide to GPL Compliance, from the Software Freedom Law Center. Where the CC-guide is primarily about citation, and therefore of interest to many Digital Humanists/Classicists who work with these licenses, the GPL-guide is a subtly different animal. Free and Open Source Software licensing is a more fraught area, since in most cases software is re-used (if at all) and embedded in a new product that includes new code as well as the the re-used FOSS parts. In some cases this new software may be sold or licensed for financial gain, or attached to services that are charged for, or otherwise part of a commercial product. It is therefore extemely useful to have this practical guide to issues of legality (including documentation and availability of license information) available to programmers and to companies that make use of FOSS code. One worth bookmarking.

Open Access Day Announced: 14 October 2008

Friday, August 29th, 2008

By way of Open Access News we learn of the announcement of Open Access Day 2008:

SPARC (the Scholarly Publishing and Academic Resources Coalition), the Public Library of Science (PLoS), and Students for Free Culture have jointly announced the first international Open Access Day. Building on the worldwide momentum toward Open Access to publicly funded research, Open Access Day will create a key opportunity for the higher education community and the general public to understand more clearly the opportunities of wider access and use of content.

Open Access Day will invite researchers, educators, librarians, students, and the public to participate in live, worldwide broadcasts of events.

How to cite Creative Commons works

Saturday, August 16th, 2008

A very useful guide is being compiled by Molly Kleinman in her Multi-Purpose Librarian blog. As someone who licenses a lot of work using CC-BY, and who both re-uses and sometimes re-mixes a lot of CC work (especially photographs) for both academic and creative ends, I recognise that it isn’t always clear exactly what “attribution” means, for example. Kleinman gives examples of ideal and realistic usage (the real name of a copyright-holder and/or title of a work may not always been known, say), and makes suggestions for good practice and compromises. This is a very welcome service, and I hope that more examples and comments follow.

Self-archiving

Tuesday, July 8th, 2008

Michael E. Smith has just blogged an opinion piece on self-archiving.

Microsoft Ends Book and Article Scanning

Saturday, May 24th, 2008

Miguel Helf, writing in the New York Times, reports:

Microsoft said Friday that it was ending a project to scan millions of books and scholarly articles and make them available on the Web … Microsoft’s decision also leaves the Internet Archive, the nonprofit digital archive that was paid by Microsoft to scan books, looking for new sources of support.

The blog post in question (by Satya Nadella, Senior vice president search, portal and advertising) indicates that both Live Search Books and Live Search Academic (the latter being Microsoft’s competitor with Google Scholar) will be shut down next week:

Books and scholarly publications will continue to be integrated into our Search results, but not through separate indexes. This also means that we are winding down our digitization initiatives, including our library scanning and our in-copyright book programs.

For its part, the Internet Archive has posted a short response addressing the situation, and focusing on the status of the out-of-copyright works Microsoft scanned and the scanning equipment they purchased (both have been donated to IA restriction-free), and on the need for eventual public funding of the IA’s work.

This story is being widely covered and discussed elsewhere; a Google News Search rounds up most sources.

new CC Journal: Glossator

Friday, May 16th, 2008

By way of the Humanist.

Glossator: Practice and Theory of the Commentary
http://ojs.gc.cuny.edu/index.php/glossator/

Glossator publishes original commentaries, editions and translations of commentaries, and essays and articles relating to the theory and history of commentary, glossing, and marginalia. The journal aims to encourage the practice of commentary as a creative form of intellectual work and to provide a forum for dialogue and reflection on the past, present, and future of this ancient genre of writing. By aligning itself, not with any particular discipline, but with a particular mode of production, Glossator gives expression to the fact that praxis founds theory.

Licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License.

Call for Submissions online for the first volume, to be published in 2009:
http://ojs.gc.cuny.edu/index.php/glossator/

New Open-Access Humanities Press Makes Its Debut

Thursday, May 8th, 2008

Article in The Chronicle of Higher Education

Scholars in the sciences have been light-years ahead of their peers in the humanities in exploring the possibilities of open-access publishing. But a new venture with prominent academic backers, the Open Humanities Press, wants to help humanists close the gap.

“Scholars in all disciplines tend to confuse online publication with the bypassing of peer review,” [Peter] Suber observed. “That’s simply mistaken.” In the humanities in particular, he said, “we’re fighting the prestige of print.”

CHE, Today’s News, May 7, 2008:

http://chronicle.com/temp/email2.php?id=WqvC6RkTkxgjB9pb92RywcgrsJVtXz9K