Archive for the ‘Projects’ Category

Suda Online entry URLs and index

Friday, January 23rd, 2009

Excellent news of upgrades to the SOL site in a recent report from Rapael Finkel:

At the request suggestion one of our translators, Nick Nicholas, I have added a link to the SOL front page called “Entire list of entries”. If you go there, you will find a list of the entire Suda entries, whether translated or not. Each is a link that gets you to the current translation; if there is none (as with phi,849 for instance), you get to the source text.

There are two things to note. First, the links are in the form, which is new. I have introduced a URL-rewrite rule in the web server that converts this sort of URL to the less memorable,6

You can use this new form of URL if you wish to embed pointers to the SOL in other web pages.

Second, the real purpose of this list of entries is so that web crawlers like Google will find it and index the contents of the entire SOL. Within a short time, we should be able to use a search engine with a search like “aaron biography omicron kurion” and find this same entry. We’ll see if that works.

The Digital Archimedes Palimpsest Released

Wednesday, October 29th, 2008

Very exciting news – the complete dataset of the Archimedes Palimpsest project (ten years in the making) has been released today. The official announcement is copied below, but I’d like to point out what I think it is that makes this project so special. It isn’t the object – the manuscript – or the content – although I’m sure the previously unknown texts are quite exciting for scholars. It isn’t even the technology, which includes multispectral imaging used to separate out the palimpsest from the overlying text and the XML transcriptions mapped to those images (although that’s a subject close to my heart).

What’s special about this project is its total dedication to open access principles, and an implied trust in the way it is being released that open access will work. There is no user interface. Instead, all project data is being released under a Creative Commons 3.0 attribution license. Under this license, anyone can take this data and do whatever they want to with it (even sell it), as long as they attribute it to the Archimedes Palimpsest project. The thinking behind this is that, by making the complete project data available, others will step up and build interfaces… create searches… make visualizations… do all kinds of cool stuff with the data that the developers might not even consider.

To be fair, this isn’t the only project I know of that is operating like this; the complete high-resolution photographs and accompanying metadata for manuscripts digitized through the Homer Multitext project are available freely, as the other project data will be when it’s completed, although the HMT as far as I know will also have its own user interface. There may be others as well. But I’m impressed that the project developers are releasing just the data, and trusting that scholars and others will create user environments of their own.

The Stoa was founded on principles of open access. It’s validating to see a high-visibility project such as the Archimedes Palimpsest take those principles seriously.

Ten years ago today, a private American collector purchased the Archimedes Palimpsest. Since that time he has guided and funded the project to conserve, image, and study the manuscript. After ten years of work, involving the expertise and goodwill of an extraordinary number of people working around the world, the Archimedes Palimpsest Project has released its data. It is a historic dataset, revealing new texts from the ancient world. It is an integrated product, weaving registered images in many wavebands of light with XML transcriptions of the Archimedes and Hyperides texts that are spatially mapped to those images. It has pushed boundaries for the imaging of documents, and relied almost exclusively on current international standards. We hope that this dataset will be a persistent digital resource for the decades to come. We also hope it will be helpful as an example for others who are conducting similar work. It published under a Creative Commons 3.0 attribution license, to ensure ease of access and the potential for widespread use. A complete facsimile of the revealed palimpsested texts is available on Googlebooks as “The Archimedes Palimpsest”. It is hoped that this is the first of many uses to which the data will be put.

For information on the Archimedes Palimpsest Project, please visit:

For the dataset, please visit:

We have set up a discussion forum on the Archimedes Palimpsest Project. Any member can invite anybody else to join. If you want to become a member, please email:

I would be grateful if you would circulate this to your friends and colleagues.

Thank you very much

Will Noel
The Walters Art Museum
October 29th, 2008.

Contribute to the Greek and Latin Treebanks at Perseus!

Thursday, August 28th, 2008

Posted on behalf of Greg Crane. Link to the Treebank, which provides more information, at the very end of the post.

We are currently looking for advanced students of Greek and Latin to contribute syntactic analyses (via a web-based system) to our existing Latin Treebank (described below) and our emerging Greek Treebank as well (for which we have just received funding). We particularly encourage students at various levels to design research projects around this new tool. We are looking in particular for the following:

  • Get paid to read Greek! We can have a limited number of research assistantships for advanced students of the languages who can work for the project from their home institutions. We particularly encourage students who can use the analyses that they produce to support research projects of their own.
  • We also encourage classes of Greek and Latin to contribute as well. Creating the syntactic analyses provides a new way to address the traditional task of parsing Greek and Latin. Your class work can then contribute to a foundational new resource for the study of Greek and Latin – both courses as a whole and individual contributors are acknowledged in the published data.
  • Students and faculty interested in conducting their own original research based on treebank data will have the option to submit their work for editorial review to have it published as part of the emerging Scaife Digital Library.

To contribute, please contact David Bamman ( or Gregory Crane (

Office of Digital Humanities: Search for funded projects

Wednesday, August 20th, 2008

Playing around on the website of the National Endowment for the Humanities’ Office of Digital Humanities this afternoon I came across the Library of Funded Projects, a database of projects funded through the ODH. Visitors can search by Categories (technical focus of the projects, not subject), Grant Programs, or by keyword. Project records include most information one would want including PI, award dates, funding, abstract, link to project website (when one exists), and a space to link project white papers (which are required at the conclusion of all ODH-funded projects).

The LFP is not up-to-date; searches for several of the grant programs come up empty (including those where there are currently funded projects). Even so, this could be an immensely valuable resource to help scholars keep abreast of new work being done in the field, especially the smaller projects supported through the Start-Up program.

(The keyword search, as most keyword searches, needs some working with. “Classics” turns up nothing, while “classical” and “ancient” pull up two different but slightly overlapping lists.)

UPDATE: Are there similar libraries/databases for other national funding agencies (DFG, JISC, etc.)? If so, please cite them in the comments. Thanks!

Problems and outcomes in digital philology (session 3: methodologies)

Thursday, March 27th, 2008

The Marriage of Mercury and Philology: Problems and outcomes in digital philology

e-Science Institute, Edinburgh, March 25-27 2008.

(Event website; programme wiki; original call)

I was asked to summarize the third session of papers in the round table discussion this afternoon. My notes (which I hope do not misrepresent anybody’s presentation too brutally) are transcribed below.

Session 3: Methodologies

1. Federico Meschini (De Montfort University) ‘Mercury ain’t what he used to be, but was he ever? Or, do electronic scholarly editions have a mercurial attitude?’ (Tuesday, 1400)

Meschini gave a very useful summary of the issues facing editors or designers of digital critical editions. The issues he raised included:

  • the need for good metadata standards to address the problems of (inevitable and to some extent desirable) incompatibility between different digital editions;
  • the need for a modularized approach that can include many very specialist tools (the “lego bricks” model);
  • the desirability of planning a flexible structure in advance so that the model can grow organically, along with the recognition that no markup language is complete, so all models need to be extensible.

After a brief discussion of the reference models available to the digital library world, he explained that digital critical editions are different from digital libraries, and therefore need different models. A digital edition is not merely a delivery of information, it is an environment with which a “reader” or “user” interacts. We need, therefore, to engage with the question: what are the functional requirements for text editions?

A final summary of some exciting recent movements, technologies, and discussions in online editions served as a useful reminder that far from taking for granted that we know what a digital critical edition should look like, we need to think very carefully about the issues Mechini raises and other discussions of this question.

2. Edward Vanhoutte (Royal Academy of Dutch Language and Literature, Belgium) ‘Electronic editions of two cultures –with apologies to C.P. Snow’ (Tuesday, 1500)

Vanhoutte began with the rhetorical observation that our approach to textual editions is in adequate because the editions are not as intuitive to users, flexible in what they can contain, and extensible in use and function as a household amenity such as the refrigerator. If the edition is an act of communication, an object that mediates between a text and an audience, then it fails if we do not address the “problem of two audiences” (citing Lavagnino). We serve the audience of our peers fairly well–although we should be aware that even this is a more hetereogenous and varied a group than we sometimes recognise–but the “common audience”, the readership who are not text editors themselves, are poorly served by current practice.

After some comments on different types of editions (a maximal edition containing all possible information would be too rich and complex for any one reader, so minimal editions of different kinds can be abstracted from this master, for example), and a summary of Robinson’s “fluid, cooperative, and distributed editions”, Vanhoutte made his own recommendation. We need, in summary, to teach our audience, preferably by example, how to use our editions and tools; how to replicate our work, the textual scholarship and the processes performed on it; how to interact with our editions; and how to contribute to them.

Lively discussion after this paper revolved around the question of what it means to educate your audience: writing a “how to” manual is not the best way to encourage engagement with ones work, but providing multiple interfaces, entry-points, and cross-references that illustrate the richness of the content might be more accessible.

3. Peter Robinson (ITSEE, Birmingham) ‘What we have been doing wrong in making digital editions, and how we could do better?’ (Tuesday, 1630)

Robinson began his provocative and speculative paper by considering a few projects that typify things we do and do not do well: we do not always distribute project output successfully; we do not always achieve the right level of scholarly research value. Most importantly, it is still near-impossible for a good critical scholar to create an online critical edition without technical support, funding for the costs of digitization, and a dedicated centre for the maintenance of a website. All of this means that grant funding is still needed for all digital critical work.

Robinson has a series of recommendations that, he hopes, will help to empower the individual scholar to work without the collaboration of a humanities computing centre to act as advisor, creator, librarian, and publisher:

  1. Make available high-quality images of all our manuscripts (this may need to be funded by a combination of goverment money, grant funding, and individual users paying for access to the results).
  2. Funding bodies should require the base data for all projects they fund to be released under a Creative Commons Attribution-ShareAlike license.
  3. Libraries and not specialist centres should hold the data of published projects.
  4. Commercial projects should be involved in the production of digital editions, bringing their experience of marketing and money-making to help make projects sustainable and self-funding.
  5. Most importantly, he proposes the adoption of common infrastructure, a set of agreed descriptors and protocols for labelling, pointing to, and sharing digital texts. An existing protocol such as the Canonical Text Services might do the job nicely.

4. Manfred Thaller (Cologne) ‘Is it more blessed to give than to receive? On the relationship between Digital Philology, Information Technology and Computer Science’ (Wednesday, 0950)

Thaller gave the last paper, on the morning of the third day of this event, in which he asked (and answered) the over-arching question: Do computer science professionals already provide everything that we need? And underlying this: Do humanists still need to engage with computer science at all? He pointed out two classes of answer to this question:

  • The intellectual response: there are things that we as humanists need and that computer science is not providing. Therefore we need to engage with the specialists to help develop these tools for ourselves.
  • The political response: maybe we are getting what we need already, but we will experience profitable side effects from collaborating with computer scientists, so we should do it anyway.

Thaller demonstrated via several examples that we do not in fact get everything we need from computer scientists. He pointed out that two big questions were identified in his own work twelve years ago: the need for software for dynamic editions, and the need for mass digitization. Since 1996 mass digitization has come a long way in Germany, and many projects are now underway to image millions of pages of manuscripts and incunabula in that country. Dynamic editions, on the other hand, while there has been some valuable work on tools and publications, seem very little closer than they were twelve years ago.

Most importantly, we as humanists need to recognize that any collaboration with computer scientists is a reciprocal arrangement, that we offer skills as well as receive services. One of the most difficult challenges facing computer scientists today, we hear, is to engage with, organise, and add semantic value to the mass of imprecise, ambiguous, incomplete, unstructured, and out-of-control data that is the Web. Humanists have spent the last two hundred years studying imprecise, ambiguous, incomplete, unstructured, and out-of-control materials. If we do not lend our experience and expertise to help the computer scientists solve this problem, than we can not expect free help from them to solve our problems.

Services and Infrastructure for a Million Books (round table)

Monday, March 17th, 2008

Million Books Workshop, Friday, March 14, 2008, Imperial College London.

The second of two round tables in the afternoon of the Million Books Workshop, chaired by Brian Fuchs (Imperial College London), asked a panel of experts what services and infrastructure they would like to see in order to make a Million Book corpus useful.

  1. Stuart Dunn (Arts and Humanities e-Science Support Centre): the kinds of questions that will be asked of the Million Books mean that the structure of this collection needs to be more sophisticated that just a library catalogue
  2. Alistair Dunning (Archaeological Data Service & JISC): powerful services are urgently needed to enable humanists both to find and to use the resources in this new collection
  3. Michael Popham (OULS but formerly director of e-Science Centre): large scale digitization is a way to break down the accidental constraints of time and place that limit access to resources in traditional libraries
  4. David Shotton (Image Bioinformatics Research Group): emphasis is on accessibility and the semantic web. It is clear than manual building of ontologies does not scale to millions of items, therefore data mining and topic modelling are required, possible assisted by crowdsourcing. It is essential to be able to integrate heterogeneous sources in a single, semantic infrastructure
    1. Dunning: citability and replicability of research becomes a concern with open publication on this scale
    2. Dunn: the archaeology world has similar concerns, cf. the recent LEAP project
  5. Paul Walk (UK Office for Library and Information Networking): concerned with what happens to the all-important role of domain expertise in this world of repurposable services: where is the librarian?
    1. Charlotte Roueché (KCL): learned societies need to play a role in assuring quality and trust in open publications
    2. Dunning: institutional repositories also need to play a role in long-term archiving. Licensing is an essential component of preservation—open licenses are required for maximum distribution of archival copies
    3. Thomas Breuel (DFKI): versioning tools and infrastructure for decentralised repositories exist (e.g. Mercurial)
    4. Fuchs: we also need mechanisms for finding, searching, identifying, and enabling data in these massive collections
    5. Walk: we need to be able to inform scholars when new data in their field of interest appears via feeds of some kind

(Disclaimer: this is only one blogger’s partial summary. The workshop organisers will publish an official report on this event.)

What would you do with a million books? (round table)

Sunday, March 16th, 2008

Million Books Workshop, Friday, March 14, 2008, Imperial College London.

In the afternoon, the first of two round table discussions concerned the uses to which massive text digitisation could be put by the curators of various collections.

The panellists were:

  • Dirk Obbink, Oxyrhynchus Papyri project, Oxford
  • Peter Robinson, Institute for Textual Scholarship and Electronic Editing, Birmingham
  • Michael Popham, Oxford University Library Services
  • Charlotte Roueché, EpiDoc and Prosopography of the Byzantine World, King’s College London
  • Keith May, English Heritage

Chaired by Gregory Crane (Perseus Digital Library), who kicked off by asking the question:

If you had all of the texts relevant to your field—scanned as page images and OCRed, but nothing more—what would you want to do with them?

  1. Roueché: analyse the texts in order to compile references toward a history of citation (and therefore a history of education) in later Greek and Latin sources.
  2. Obbink: generate a queriable corpus
  3. Robinson: compare editions and manuscripts for errors, variants, etc.
    1. Crane: machine annotation might achieve results not possible with human annotation (especially at this scale), particularly if learning from a human-edited example
    2. Obbink: identification of text from lost manuscripts and witnesses toward generation of stemmata. Important question: do we also need to preserve apparatus criticus?
  4. May: perform detailed place and time investigations into a site preparatory to performing any new excavations
    1. Crane: data mining and topic modelling could lead to the machine-generation of an automatically annotated gazeteer, prosopography, dictionary, etc.
  5. Popham: metadata on digital texts scanned by Google not always accurate or complete; not to academic standards: the scanning project is for accessibility, not preservation
    1. Roueché: Are we talking about purely academic exploitation, or our duty as public servants to make our research accessible to the wider public?
    2. May: this is where topic analysis can make texts more accessible to the non-specialist audience
    3. Brian Fuchs (ICL): insurance and price comparison sites, Amazon, etc., have sophisticated algorithms for targeting web materials at particular audiences
    4. Obbink: we will also therefore need translations of all of these texts if we are reaching out to non-specialists; will machine translation be able to help with this?
    5. Roueché: and not just translations into English, we need to make these resources available to the whole world.

(Disclaimer: this summary is partial and partisan, reflecting those elements of the discussion that seemed most interesting and relevant to this blogger. The workshop organisers will publish an official report on this event presently.)

Million Books Workshop (brief report)

Saturday, March 15th, 2008

Imperial College London.
Friday, March 14, 2008.

David Smith gave the first paper of the morning on “From Text to Information: Machine Translation”. The discussion included a survey of machine translation techniques (including the automatic discovery of existing translations by language comparison), and some of the value of cross-language searching.

[Please would somebody who did not miss the beginning of the session provide a more complete summary of Smith’s paper?]

Thomas Breuel then spoke on “From Image to Text: OCR and Mass Digitisation” (this would have been the first paper in the day, kicking off the developing thread from image to text to information to meaning, but transport problems caused the sequence of presentations to be altered). Breuel discussed the status of professional OCR packages, which are usually not very trainable and have their accuracy constrained by speed requirements, and explained how the Google-sponsored but Open Source OCRopus package intends to improve on this situation. OCRopus is highly extensible and trainable, but currently geared to the needs of the Google Print project (and so while effective at scanning book pages, may be less so for more generic documents). Currently in alpha-release and incorporating the Tesseract OCR engine, this tool currently has a lower error-rate than other Open Source OCR tools (but not the professional tools, which often contain ad hoc code to deal with special cases). A beta release is set for April 2008, which will demo English, German, and Russian language versions, and release 1.0 is scheduled for Fall 2008. Breuel also briefly discussed the hOCR microformat for describing page layouts in a combination of HTML and CSS3.

David Bamman gave the second in the “From Text to Information” sequence of papers, in which he discussed building a dynamic lexicon using automated syntax recognition, identifying the grammatical contexts of words in a digital text. With a training set of some thousands of words of Greek and Latin tree-banked by hand, auto-syntactic parsing currently achieves an accuracy rate something above 50%. While this is still too high a rate of error to make this automated process useful as an end in itself, to deliver syntactic tagging to language students, for example, it is good for testing against a human-edited lexicon, which provides a degree of control. Usage statistics and comparisons of related words and meanings give a good sense of the likely sense of a word or form in a given context.

David Mimno completed the thread with a presentation on “From Information to Meaning: Machine Learning and Classification Techniques”. He discussed automated classification based on typical and statistical features (usually binary indicators: is this email spam or not? Is this play tragedy or comedy?). Sequences of objects allow for a different kind of processing (for example spell-checking), including named entity recognition. Names need to be identified not only by their form but by their context, and machines do a surprisingly good job at identifying coreference and thus disambiguating between homonyms. A more flexible form of automatic classification is provided by topic modelling, which allows mixed classifications and does not require the definition of labels. Topic modelling is the automatic grouping of topics, keywords, components, relationships by the frequency of clusters of words and references. This modelling mechanism is an effective means for organising a library collection by automated topic clusters, for example, rather than by a one-dimensional and rather arbitrary classmark system. Generating multiple connections between publications might be a more effective and more useful way to organise a citation index for Classical Studies than the outdated project that is l’Année Philologique.

Simon Overell gave a short presentation on his doctoral research into the distribution of location references within different language versions of Wikipedia. Using the tagged location links as disambiguators, and using the language cross-reference tags to compare across the collections, he uses the statistics compiled to analyse bias (in a supposedly Neutral Point-Of-View publication) and provide support for placename disambiguation. Overell’s work is in progress, and he is actively seeking collaborators who might have projects that could use his data.

In the afternoon there were two round-table discussions on the subjects of “Collections” and “Systems and Infrastructure” that I may report on later if my notes turn out to be usable.

Information Behaviour of the Researcher of the Future (report)

Sunday, January 20th, 2008

The British Library and JISC commissioned the Centre for Information Behaviour and the Evaluation of Research (CIBER) at UCL to produce a report on Information Behaviour of the Researcher of the Future. It’s well worth reading the full reportin PDF (which I haven’t finished yet) but among the conclusions listed by the BL press release on this are:

  • All age groups revealed to share so-called ‘Google Generation’ traits
  • New study argues that libraries will have to adapt to the digital mindset
  • Young people seemingly lacking in information skills; strong message to the government and society at large

A new study overturns the common assumption that the ‘Google Generation’ – youngsters born or brought up in the Internet age – is the most web-literate. The first ever virtual longitudinal study carried out by the CIBER research team at University College London claims that, although young people demonstrate an apparent ease and familiarity with computers, they rely heavily on search engines, view rather than read and do not possess the critical and analytical skills to assess the information that they find on the web.

This is a very interesting combination of conclusions–although many of us have been observing for years that while our youngest students may think they know everything about computers they often don’t actually know the first thing about using the Internet for research (nor, needless to say, about opening up a computer–either physically or metaphorically–and playing with its innards). That the GoogleGen traits such as short attention span, impatience with anything not in the first page of search results, and readiness to flit from topic to topic in the wikiblogoatomosphere are not restricted to teenagers is not news to we “gray ADDers” either.

The suggestion that libraries, the ultimate custodians of both raw data and interpreted information (and, I would argue, especially schools and universities), need to be functioning in the spirit of this new digital world and serving the needs of our plugged-in and distracted community. Not by making information available in bite-sized, easily identified and digested pieces–that would be pandering, not serving–but by providing educational resources alongside the traditional preserved texts/media. And microformatting it (because our target-audience don’t necessarily know they’re our audience). And future-proofing it.

Web-based Research Tools for Mediterranean Archaeology

Friday, January 4th, 2008

Workshop at the 2008 annual meeting of the Archaeological Institute of America in Chicago

Sunday, 6 January 2008, 9:00 a.m. – noon, Water Tower, Bronze Level, West Tower, Hyatt Regency Hotel

Moderators: Rebecca K. Schindler and Pedar Foss, DePauw University

In recent years several powerful web-based research tools for Mediterranean archaeology have emerged; this workshop brings together researchers who are building and/or maintaining them. Having examined each other’s projects beforehand, presenters demonstrate their own projects, assess their functionality and usefulness, and discuss future needs and possibilities.

The projects range from macro-scale (country- or Mediterranean-wide metadata) to micro-scale (specific sites and artifact types). Two initiatives are on-line databases for archaeological fieldwork: Foss and Schindler demonstrate MAGIS, and inventory of survey projects across Europe and the Mediterranean; Fentress demonstrates the Fasti OnLine, which records excavations in Italy and several neighboring countries. Both projects employ web-based GIS to allow spatial and database searches. With the release of Google Earth and Google Maps, GIS functionality for tracking landscapes has become widely available to mainstream, not just specialist, users. Savage offers the Jordan Archaeological Database and Information System (JADIS) as a case-study of how Google-GIS functionality may be employed in archaeological research.

Numerous archaeological projects use the web to present and collect data (to varying degrees of detail). Watkinson and Hartzler demonstrate the Agora Excavations on-line, an example of how the web can clearly present a complex, long-excavated site through its organization of artifacts, documentary materials, and visual interfaces. Heath then gives a close-up look at the on-line study collection of ceramics from Ilion; what is the potential for Web-based reference collections to enhance the study of ceramic production and distribution?

ArchAtlas, presented by Harlan and Wilkinson, and the Pleiades Project, presented by Elliott, both seek to link geo-spatial and archaeological data through on-line collaborations. These projects raise issues of interoperability and shared datasets. ArchAtlas aims to be a hub for interpretive cartographic visualization of archaeological problems and data; Pleiades is developing an atlas of ancient sites. Finally, Chavez from the Perseus Project considers the challenges of accessibility, sustainability, and viability in the ever-changing world of technology — how do we ensure that these projects are still usable 20 years from now, and what new resources can we imagine developing?

These projects are representative of the types of on-line initiatives for Mediterranean archaeology in current development. Their tools enable the compilation and dissemination of large amounts of information that can lead to interesting new questions about the Mediterranean world. This is a critical time to step back, assess the resources, and consider future needs and desires.


  • Pedar Foss (DePauw University)
  • Elizabeth Fentress (International Association for Classical Archaeology)
  • Stephen Savage (Arizona State University)
  • Bruce Hartzler and Charles Watkinson (American School of Classical Studies at Athens)
  • Sebastian Heath (American Numismatic Society)
  • Tom Elliott (University of North Carolina at Chapel Hill)
  • Debi Harlan (Oxford University)
  • Toby Wilkinson (British Institute at Ankara)
  • Robert Chavez (Tufts University)

Technology Collaboration Awards

Saturday, December 15th, 2007

An announcement from Mellow (via the CHE):

Five universities were among the 10 winners of the Mellon Awards for Technology Collaboration, announced this week. They will share $650,000 in prize money for “leadership in the collaborative development of open-source software tools with application to scholarship in the arts and humanities.” The university winners were:

  • Duke University for the OpenCroquet open-source 3-D virtual worlds environment
  • Open Polytechnic of New Zealand for several projects, including the New Zealand Open Source Virtual Learning Environment
  • Middlebury College for the Segue interactive-learning management system
  • University of Illinois at Champaign-Urbana for two projects: the Firefox Accessibility Extension and the OpenEAI enterprise application integration project
  • University of Toronto for the ATutor learning content-management system.

Other winners included the American Museum of the Moving Image for a collections-management system, and the Participatory Culture Foundation for the Miro media player. The winners were announced at the fall task-force meeting of the Coalition for Networked Information, and awards were presented by the World Wide Web pioneer Tim Berners-Lee. –Josh Fischman

Perseus code goes Open Source!

Tuesday, November 13th, 2007

From Greg Crane comes the much-anticipated word that all of the hopper code and much of the content in Perseus is now officially open sourced:

November 9, 2007: o *Install Perseus 4.0 on your computer*:

All of the source code for the Perseus Java Hopper and much of the content in Perseus is now available under an open source license. You can download the code, compile it, and run it on your own system. This requires more labor and a certain level of expertise for which we can only provide minimal support. However, since it will be running on your own machine, it can be much faster than our website, especially during peak usage times. You also have the option to install only certain collections or texts on your version, making it as specialized as you wish. Also, if you want to use a different system to make the content available, you can do so within the terms of the Creative Commons license. This is the first step in open sourcing the code: you can modify the code as much as you want, but at this time, we cannot integrate your changes back into our system. That is our ultimate goal, so keep a look out for that!

Download source code here

Download text data here

Open Library

Saturday, October 27th, 2007

Adding this grandiose Open Library system to the Internet Archive strikes me as simply brilliant. In this case “fully open” is defined as “a product of the people: letting them create and curate its catalog, contribute to its content, participate in its governance, and have full, free access to its data. In an era where library data and Internet databases are being run by money-seeking companies behind closed doors, it’s more important than ever to be open.”

But simply building a new database wasn’t enough. We needed to build a new wiki to take advantage of it. So we built Infogami. Infogami is a cleaner, simpler wiki. But unlike other wikis, it has the flexibility to handle different classes of data. Most wikis only let you store unstructured pages — big blocks of text. Infogami lets you store semistructured data…

Each infogami page (i.e. something with a URL) has an associated type. Each type contains a schema that states what fields can be used with it and what format those fields are in. Those are used to generate view and edit templates which can then be further customized as a particular type requires.

The result, as you can see on the Open Library site, is that one wiki contains pages that represent books, pages that represent authors, and pages that are simply wiki pages, each with their own distinct look and edit templates and set of data.

English-Latin-English dictionaries

Monday, October 1st, 2007

from the mailbag:

My name is Silvio and I’ve recently concluded a set of English-Latin-English dictionaries which I thought you could be interested in sharing with your site’s visitors. The dictionaries provide clear and precise translations and are absolutely free of charge.

Latin Dictionary:

If you have any feedback on them, I’d be happy to hear.


Silvio Branco

(Note: I cannot vouch for these dictionaries but simply pass along the announcement.) 

Two new blogs

Thursday, September 27th, 2007
  • Tom Elliott, Horothesia: thoughts and comments across the boundaries of computing, ancient history, epigraphy and geography.
  • Shawn Graham, Electric Archaeology: Digital Media for Learning and Research.  Agent based modeling, games, virtual worlds, and online education for archaeology and history.

Cuneiform Digital Library Initiative and Digital Library Program of UCLA

Wednesday, September 26th, 2007

The Cuneiform Digital Library Initiative and the Digital Library Program of the University of California, Los Angeles, are pleased to announce their successful proposal to the Institute for Museum and Library Services program “National Leadership Grants: Building Digital Resources” for funding of a two-year project dedicated to improving data management and archiving tools in Humanities research.

Project Title: “Cuneiform Digital Library Initiative: Second Generation”

The UCLA University Library and UCLA’s Department of Near Eastern Languages and Cultures will create the Cuneiform Digital Library Initiative: Second Generation (CDLI 2). The project will migrate 450,000 legacy archival and access images and metadata from CDLI to UCLA’s Digital Library Content System, standardizing and upgrading the metadata to improve discovery and enable content archiving within the California Digital Library’s Digital Preservation Repository. The project will add 7,000 digital artifacts with cuneiform inscriptions, including collections housed at the University of Chicago’s Oriental Institute and in Syrian national museums. This project will ensure the long-term preservation of text inscribed on endangered ancient cuneiform tablets. (see the IMLS notice of grants in this cycle)

Principal Investigators:

Stephen Davison
Robert K. Englund

Virtual London shelved as OS refuse to license data to Google

Wednesday, August 29th, 2007

Seen in last week’s New Scientist:

A 3D software model of London containing over 3 million buildings in photorealistic detail is now unlikely to reach the public because of a dispute between a UK government agency and Google.

The full article requires subscription, but the long and short of it is that Google wanted to incorporate the Ordnance Survey-derived data from the Centre for Advanced Spatial Awareness (at UCL) into Google Earth, and were negotiating for a one-off license fee to cover the rights. However, the British agency Ordnance Survey refused to license their data on anything but a license that required payments based on the number of users. Some mapping and visualisation experts fear that this is more significant than a simple failure of two commercial entities to reach an agreement.

Timothy Foresman, director-general of the fifth International Symposium on Digital Earth in San Francisco in June, fears that OS’s decision could set a precedent: “The OS model is a dinosaur,” he says. “If the UK community doesn’t band together and make this a cause célèbre, then they will find the road is blocked as further uses [of the OS data] become known.”

E-Science, Imaging Technology and Ancient Documents

Wednesday, August 22nd, 2007

See and forwarded from Classicists mailing list




Sub-Faculty of Ancient History

E-Science, Imaging Technology and
Ancient Documents

Applications are invited for two posts for which funding has been secured through the AHRC-EPSRC-JISC Arts and Humanities E-Science initiative to support research on the application of Information Technology to ancient documents. Both posts are attached to a project which will develop a networked software system that can support the imaging, documentation, and interpretation of damaged texts from the ancient world, principally Greek and Latin papyri, inscriptions and writing tablets. The work will be conducted under the supervision of Professors Alan Bowman FBA, Sir Michael Brady FRS FREng (University of Oxford) and and Dr. Melissa Terras (University College London).

1. A Doctoral Studentship for a period of 4 years from 1 January, 2008. The studentship will be held in the Faculty of Classics (Sub-Faculty of Ancient History) and supported at the Centre for the Study of Ancient Documents and the Oxford E-Research Centre. The Studentship award covers both the cost of tuition fees at Home/EU rates and a maintenance grant. To be eligible for a full award, the student must have been ordinarily resident in the UK for a period of 3 years before the start of the award.

2. A postdoctoral Research Assistantship for a period of 3 years from 1 January, 2008. The post will be held in the Faculty of Classics (Sub-Faculty of Ancient History) and supported at the Centre for the Study of Ancient Documents and the Oxford E-Research Centre. The salary will be in the range of £26,666 – £31,840 p.a. Applicants must have expertise in programming and Informatics and an interest in the application of imaging technology and signal-processing to manuscripts and documents.

The deadline for receipt of applications is 21 September 2007. Further details about both posts, the project, the qualifications required and the method of application are available from Ms Ghislaine Rowe, Graduate Studies Administrator, Ioannou Centre for Classical and Byzantine Studies, 66 St Giles’ , Oxford OX1 3LU (01865 288397, It is hoped that interviews will be held and the appointments made on 11 October.

Professor Alan Bowman
Camden Professor of Ancient History
Brasenose College,
Oxford OX1 4AJ
+44 (0)1865 277874

Director, Centre for the Study of Ancient Documents
The Stelios Ioannou School for Research in Classical and Byzantine Studies
66 St Giles’
Oxford OX1 3LU
+44 (0)1865 610227

The Common Information Environment and Creative Commons

Sunday, August 5th, 2007

Seen on the Creative Commons blog:

A study titled “The Common Information Environment and Creative Commons” was funded by Becta, the British Library, DfES, JISC and the MLA on behalf of the Common Information Environment. The work was carried out by Intrallect and the AHRC Research Centre for studies in Intellectual Property and Technology Law and a report was produced in the Autumn of 2005. During the Common Information Environment study it was noted that there was considerable enthusiasm for the use of Creative Commons licences from both cultural heritage organisations and the educational and research community. In this study we aim to investigate if this enthusiasm is still strong and whether a significant number of cultural heritage organisations are publishing digital resources under open content licences.

(Full report.)

This is an interesting study worth watching, and hopefully the conclusions and recommendations will include advice on coherent legal positions with regards to Open Content licensing. (See the controversy surrounding yesterday’s post.)

UK JISC Digitisation Conference 2007

Wednesday, August 1st, 2007

Joint Information Systems Committee

Copied from JISC Digitisation Blog

“In July 2007 JISC held a two-day digitisation conference in Cardiff and the event was live blogged and podcasted. Here you can find links to all the resources from the conference, from Powerpoint presentations and audio to the live reports and conference wiki.”

The link to this blog which has audio, Powerpoints and PDFs from the wide range of speakers:

There is much there about building digital content and e-resources.

More can be found about the JISC digitisation programme at:

Electronic corpora of ancient languages

Wednesday, July 25th, 2007

Posted to the Digital Classicist list (from ancientpglinguistics) by Kalle Korhonen:

Electronic corpora of ancient languages

International Conference
Prague (Czech Republic), November 16-17th, 2007
Call for papers

Aims of conference

Electronic corpora of ancient languages offer important information about the culture and history of ancient civilizations, but at the same time they constitute a valuable source of linguistic information. The scholarly community is increasingly aware of the importance of computer-aided analysis of these corpora, and of the rewards it can bring. The construction of electronic corpora for ancient languages is a complex task. Many more pieces of information have to be taken into account than for living languages, e.g. the artefact bearing the text, lacunae, level of restoration, etc. The electronic corpora can be enriched with links to images, annotations, and other secondary sources. The annotations should deal with matters such as textual damage, possible variant readings, etc., as well as with many features specific to ancient languages. (more…)

Chiron pool at Flickr

Wednesday, June 27th, 2007

Alun Salt notes

Recently the 5000th photo was uploaded to the Chiron pool at Flickr. That’s over 5000 photos connected to antiquity which you can pick up and use in presentations or blogs for free. It’s due in no small part to the submissions by Ovando and MHarrsch, but there’s 130 other members. It’s a simple interface and an excellent example of what you can do with Flickr.

You can see the latest additions to Chiron in the photobar at the top of the page and you can visit the website of the people who had such a good idea at Chironweb.

Forthcoming lectures on arts and humanities e-science

Wednesday, June 27th, 2007

Forwarded from AHESC Arts and Humanites e-Science Support Centre

The next lectures in the e-Science in the Arts and Humanities Theme (see begin next week. The Theme, organized by the Arts and Humanities e-Science Support Centre (AHeSSC) and hosted by the e-Science Institute in Edinburgh, aims to explore the new challenges for research in the Arts and Humanities
and to define the new research agenda that is made possible by e-Science technology.

The lectures are:

Monday 2 July: Grid Enabling Humanities Datasets

Friday 6 July: e-Science and Performance

Monday 23 July: Aspects of Space and Time in Humanities e-Science

In all cases it will be possible to view the lecture on webcast, and to ask questions or contribute to the debate, in real time via the blog feature. Please visit E-Science_in_the_Arts_and_Humanities, and follow the ‘Ask questions
during the lecture’ link for more information about the blog, and the ‘More details’ link for more information about the events themselves and the webcasts.

AHeSSC forms a critical part of the AHRC-JISC initiative on e-Science in Arts and Humanities research. The Centre is hosted by King’s College London and located at the Arts and Humanities Data Service (AHDS) and the AHRC Methods Network. AHeSSC exists to support, co-ordinate and promote e-Science in all arts and humanities disciplines, and to liaise with the e-Science and e-Social Science communities, computing, and information sciences.

Please contact Stuart Dunn (stuart.dunn[at] or Tobias Blanke
(tobias.blanke[at] at AHeSSC for more information.

100+ million word corpus of American English (1920s-2000s)

Monday, June 25th, 2007

Saw this on Humanist. Anything out there and also freely available for UK English?

A new 100+ million word corpus of American English (1920s-2000s) is now freely available at:

The corpus is based on more than 275,000 articles in TIME magazine from 1923 to 2006, and it contains articles on a wide range of topics – domestic and international, sports, financial, cultural, entertainment, personal interest, etc.

The architecture and interface is similar to the one that we have created for our version of the British National Corpus (see, and it allows users to:

— Find the frequency of particular words, phrases, substrings (prefixes, suffixes, roots) in each decade from the 1920s-2000s. Users can also limit the results by frequency in any set of years or decades. They can also see charts that show the totals for all matching strings in each decade (1920s-2000s), as well as each year within a given decade.

— Study changes in syntax since the 1920s. The corpus has been tagged for part of speech with CLAWS (the same tagger used for the BNC), and users can easily carry out searches like the following (from among endless possibilities): changes in the overall frequency of “going + to + V”, or “end up V-ing”, or preposition stranding (e.g. “[VV*] with .”), or phrasal verbs (1920s-1940s vs 1980s-2000s).

— Look at changes in collocates to investigate semantic shifts during the past 80 years. Users can find collocates up to 10 words to left or right of node word, and sort and limit by frequency in any set of years or decades.

— As mentioned, the interface is designed to easily permit comparisons between different sets of decades or years. For example, with one simple query users could find words ending in -dom that are much more frequent 1920s-40s than 1980s-1990s, nouns occurring with “hard” in 1940s-50s but not in the 1960s, adjectives that are more common 2003-06 than 2000-02, or phrasal verbs whose usage increases markedly after the 1950s, etc.

— Users can easily create customized lists (semantically-related words, specialized part of speech category, morphologically-related words, etc), and then use these lists directly as part of the query syntax.


For more information, please contact Mark Davies (, or visit:

for information and links to related corpora, including the upcoming BYU American National Corpus [BANC] (350+ million words, 1990-2007+).

—– Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University

Promise and challenge: augmenting places with sources

Tuesday, June 19th, 2007

Bill Turkel has some very interesting things to say about “the widespread digitization of historical sources” and — near and dear to my heart — “augmenting places with sources”:

The last paragraph in “Seeing There” resonated especially, given what we’re trying to do with Pleiades:

The widespread digitization of historical sources raises the question of what kinds of top-level views we can have into the past. Obviously it’s possible to visit an archive in real life or in Second Life, and easy to imagine locating the archive in Google Earth. It is also possible to geocode sources, link each to the places to which it relates or refers. Some of this will be done manually and accurately, some automatically with a lower degree of accuracy. Augmenting places with sources, however, raises new questions about selectivity. Without some way of filtering or making sense of these place-based records, what we’ll end up with at best will be an overview, and not topsight.

There’s an ecosystem of digital scholarship building. And I’m not talking about SOAP, RDF or OGC. I’m talking about generic function and effect …  Is your digital publication epigraphic? Papyrological? Literary? Archaeological? Numismatic? Encyclopedic? A lumbering giant library book hoover? Your/my data is our/your metadata (if we/you eschew walls and fences). When we all cite each other and remix each other’s data in ways that software agents can exploit, what new visualizations/abstractions/interpretations will arise to empower the next generation of scholarly inquiry? Stay tuned (and plug in)!