It’s a shame the JPEG 2000 bandwagon has been creeping along at such a slow pace, but this seems like good news from the LOC.
Archive for the ‘Tools’ Category
from Google Blogoscoped:
Google switched to a new translation system for the remaining language pairs on the Google Translator which were so far provided by Systran. The translator help files don’t mention this yet, but it might be possible that the new translations are the results of Google’s in-house machine translation efforts.
In a quick comparison between Systran translation and Google translation of the English < --> German language pair, I couldn’t see a clear winner yet (though I get the feeling Google’s results are slightly superior), but a lot of garbage results on both ends. Translating a sample blog post into German, for instance, was so bad that you’d have a hard time making any sense out of what was written if you don’t speak English. While it might help to get the point across for some texts, you start to wonder if these kind of translations in the wild will cause more understanding in the world, or more misunderstanding.
from Shawn Graham’s Electric Archaeology:
From an archaeological point of view, creating 3d representations of a site using Sketchup, and then moving that with the terrain into an online world, with the associated annotations etc could really be revolutionary – what immediately springs to mind is that this would make a far better way of publishing a site than a traditional monograph. Internet Archaeology (the journal) has been trying for just that kind of thing for a while. Maybe IA should host a world in Multiverse…?
from Kathleen Fitzpatrick, “CommentPress: New (Social) Structures for New (Networked) Texts,” Journal of Electronic Publishing, Fall 2007:
… CommentPress demonstrates the fruitfulness of reimagining the technologies of electronic publishing in service to the social interconnections of authors and readers. The success of the electronic publishing ventures of the future will likely hinge on the liveliness of the conversations and interactions that they can produce, and the further new writing that those interactions can inspire. CommentPress grows out of an understanding that the chief problem involved in creating the future of the book is not simply placing the words on the screen, but structuring their delivery in an engaging manner; the issue of engagement, moreover, is not simply about locating the text within the technological network, but also, and primarily, about locating it within the social network. These are the problems that developers must focus on in seeking the electronic form that can not just rival but outdo the codex, as a form that invites the reader in, that acknowledges that the reader wants to respond, and that understands all publication as part of an ongoing series of public conversations, conducted in multiple time registers, across multiple texts. Making those conversations as accessible and inviting as possible should be the goal in imagining the textual communications circuit of the future.
posted to the TEI list
Wiki2Tei converter 1.0
We are pleased to announce the first release of the Wiki2Tei software. Wiki2Tei is a converter from the mediawiki format to XML (TEI vocabulary).
The mediawiki format is used by wikimedia fundation wikis (Wikipedia, Wikibooks, Wikisource), and many other wikis using the mediawiki software. Large amounts of free hight-quality structured texts are available in this format. These texts are used more and more often in NLP (natural language processing) projects. However, the mediawiki parser is oriented towards rendition and the mediawiki syntax is complex and hard to parse.
The Wiki2Tei converter makes available the information contained in wiki syntax (structuration, highlighting, etc.), and allows to properly retrieve the plain text. This conversion is intended to preserve all the properties of the original text. Wiki2Tei is closely coupled with the mediawiki software, allowing to convert all the features of the mediawiki syntax.
The Wiki2Tei converter provides a rich set of tools for converting mediawiki text from several sources (file, mediawiki database) and managing collections of files to be converted. The TEI vocabulary used is documented, according to the TEI Guidelines, in an ODD document. The code is open source and may be downloaded from the SourceForge download area:
The web site contains full documentation and a “demo”:
A mailing list is open:
A nice visual overview of the purposes and mechanisms for version control, from Better Explained.
Inside Google Book Search offers an update of “New ways to dig into Book Search.”
The Cuneiform Digital Library Initiative and the Digital Library Program of the University of California, Los Angeles, are pleased to announce their successful proposal to the Institute for Museum and Library Services program “National Leadership Grants: Building Digital Resources” for funding of a two-year project dedicated to improving data management and archiving tools in Humanities research.
Project Title: “Cuneiform Digital Library Initiative: Second Generation”
The UCLA University Library and UCLA’s Department of Near Eastern Languages and Cultures will create the Cuneiform Digital Library Initiative: Second Generation (CDLI 2). The project will migrate 450,000 legacy archival and access images and metadata from CDLI to UCLA’s Digital Library Content System, standardizing and upgrading the metadata to improve discovery and enable content archiving within the California Digital Library’s Digital Preservation Repository. The project will add 7,000 digital artifacts with cuneiform inscriptions, including collections housed at the University of Chicago’s Oriental Institute and in Syrian national museums. This project will ensure the long-term preservation of text inscribed on endangered ancient cuneiform tablets. (see the IMLS notice of grants in this cycle)
Robert K. Englund
Hearing Mojo is not happy:
I can’t believe Apple failed to make its iPhone compatible with either hearing aids or cochlear implants. I’m in the market for a mobile phone again and just discovered the lack of compatibility. Given all the hype surrounding the iPhone launch, I’m surprised there haven’t been more complaints, other than the strong objection I just found on Paula Rosenthal’s HearingExchange site, some chatter on Apple forums, and a complaint made to the FCC by the Hearing Loss Association of America. HLAA has done the most advocacy for hearing-aid compatibility (HAC) regulations, which now mandate 50 percent of manufacturers’ handsets meet minimum M3 compatibility standards. The M3 and M4 ratings mean there’s no buzzing when you listen to the phone with your hearing-aid microphone on, and T3 and T4 ratings mean the phone works with the telecoils in your hearing aids. But according to the HLAA complaint: “Apple has now entered the scene and is predicted to shake up the entire wireless industry. Yet they are not, nor have ever been, involved in any discussions regarding HAC requirements.” Steve Jobs is known for his arrogance and inflexibility when it comes to the design of his products. Apple’s treatment of the hearing-impaired population is a great example. What a disappointment.
The ultra-powerful I22 Non-crystalline Diffraction beamline (as best as I understand it an application of the laser particle accellerator that produces highly concentrated pure light for scanning at nanoscopic resolutions) is being applied to the reading of damaged parchment and other ancient and at-risk documents. The synchrotron can analyse the condition of collagen in paper or vellum and determine the patterns of any potentially corrosive ink; this is particularly valuable in cases of very fragile texts, such as those partially eaten away by iron gall ink, or ancient dessicated manuscripts such as the Dead Sea Scrolls.
I first heard about this story–albeit in very vague terms–at a party last night, and I have to say that my first reaction was disbelief. I assumed that the speaker (neither a digital humanist nor a manuscript scholar) had misunderstood or misrepresented the story of a particle accellerator the size of four football pitches being used to read the Dead Sea Scrolls. Surely the expense involved would just never be spent on something as niche as manuscript studies? (Not to mention that I know excellent results are already being achieved using standard medical imaging technology.) I apologise to my nameless source for my lack of faith. I guess I need reminding occasionally that even people with big and expensive fish to fry can share our obsession with digital and humanistic concerns.
Why I gave up on my university’s email years ago:
Along with the neat-o peripheral gizmos like messaging, calendars, and collaboration tools, the outsourced systems are more stable, have better spam filters, and provide much more storage space than the typical university’s in-house system.
Seemed like a no-brainer… (Colleges Outsource E-mail to Big Players, U.S.News & World Report)
From liquidicity, keyboard shortcuts for about every character key available on a Mac.
A very interesting site has been doing the rounds of news and blogs lately, which allows users to trace anonymous edits of Wikipedia articles by comparing to the public record of registered IP addresses. The WikiScanner is itself neutral as to the kind of searches one may carry out–it merely accesses and mashes-up information from two publicly available sources–but many of the most public implementations (such as those collected by Wired magazine) have been political, moral, or salacious. So, for example, users with an IP address registered to the office of a given religious organisation might be shown to have “anonymously” edited the Wikipedia entry on that religion, whitewashed crimes or scandals, or slandered rival groups or individuals of their own organisation. (All this by way of example only–actual instances you can look up for yourself.)
This is not only an interesting and imaginative example of a mashup, but also a potentially very useful control on one of the biggest threats to Wikipedia’s much-vaunted “neutral point of view”–namely the ability of wealthy corporations or individuals to hire lobbyists and PR agencies to clean up their profile on the web. More transparency means more accountability means more reliable information. Potentially. Effectively this tool removes the ability to edit completely anonymously, without raising the bar to entry in the Wiki community by requiring registration and identification.
I’ve yet to find any interesting academic examples of biased “anonymous” edits–and I guess they’d be hard to pin down because the range of IPs registered to a university would typically include lab workstations and other machines accessible by a large number of people. I’m sure something interesting will turn up, however. Keep looking.
The No Thick Manuals wiki details how to learn a language efficiently using two free, open source applications. The first is jVLT (java Vocabulary Learning Tool), a completely cross platform flash card application. The second is StarDict, a Windows/Linux-only dictionary that provides an impressive array of features and dictionaries. Granted, most of us would require some textbooks and/or audio supplements, but anyone learning a language needs a good dictionary and some flash cards, and these free desktop applications sound a lot simpler than making flash cards by hand and manually looking up words in your dictionary.
CommentPress is a free theme for the WordPress blogging engine that allows readers to comment paragraph by paragraph in the margins of a text. Annotate, gloss, workshop, debate: with CommentPress you can do all of these things on a finer-grained level, turning a document into a conversation. It can be applied to a fixed document (paper/essay/book etc.) or to a running blog. CommentPress was developed by the Institute for the Future of the Book “to enable social interaction around long-form texts.” Some of the possibilities:
- scholarly contexts: working papers, conferences, annotation projects, journals, collaborative glosses
- educational: virtual classroom discussion around readings, study groups
- journalism/public advocacy/networked democracy: social assessment and public dissection of government or corporate documents, cutting through opaque language and spin
- creative writing: workshopping story drafts, collaborative storytelling
- recreational: social reading, book clubs
Update: University Publishing In A Digital Age now set up for social annotation.
Forwarded from AHESC Arts and Humanites e-Science Support Centre
The next lectures in the e-Science in the Arts and Humanities Theme (see http://www.ahessc.ac.uk/theme) begin next week. The Theme, organized by the Arts and Humanities e-Science Support Centre (AHeSSC) and hosted by the e-Science Institute in Edinburgh, aims to explore the new challenges for research in the Arts and Humanities
and to define the new research agenda that is made possible by e-Science technology.
The lectures are:
Monday 2 July: Grid Enabling Humanities Datasets
Friday 6 July: e-Science and Performance
Monday 23 July: Aspects of Space and Time in Humanities e-Science
In all cases it will be possible to view the lecture on webcast, and to ask questions or contribute to the debate, in real time via the arts-humanities.net blog feature. Please visit http://wiki.esi.ac.uk/ E-Science_in_the_Arts_and_Humanities, and follow the ‘Ask questions
during the lecture’ link for more information about the blog, and the ‘More details’ link for more information about the events themselves and the webcasts.
AHeSSC forms a critical part of the AHRC-JISC initiative on e-Science in Arts and Humanities research. The Centre is hosted by King’s College London and located at the Arts and Humanities Data Service (AHDS) and the AHRC Methods Network. AHeSSC exists to support, co-ordinate and promote e-Science in all arts and humanities disciplines, and to liaise with the e-Science and e-Social Science communities, computing, and information sciences.
Please contact Stuart Dunn (stuart.dunn[at]kcl.ac.uk) or Tobias Blanke
(tobias.blanke[at]kcl.ac.uk) at AHeSSC for more information.
Shawn Graham (at the Electric Archaeology blog) has uploaded a copy of his paper, at the recent Immersive Worlds conference at Brock. The paper can be downloaded (as a .wav) here: ‘On Second Lives and Past Lifes: Archaeological Thoughts on the Metaverse‘ (via the EA post).
This is obviously a huge and very relevant topic at the moment, since the Digital Classicist Seminar in London yesterday was addressed by Timothy Hill under the title: ‘Wiser than the Undeceived? Past Worlds as Virtual Worlds in the Electronic Media’. (And Dunstan Lowe will also address recreational software in the same series in three weeks time.)
The New Scientist this week reports on the Encyclopedia of Life, a new, massive, collaborative, evolving resource to catalogue the 1.8 million known species of life on the planet. Although this is a biology resource and so, for example, has access to greater funding sources than most of us in the humanities dream of (E. O. Wilson has apparently already reaised $50 million in set-up funds), a lot of the issues of collaborative research and publication, of evolving content, of citability, of authority, of copyright, of free access, and of winning the engagement of the research community as a whole are exactly the same as we face. It would serve us well to watch how this resource develops.
It is a truism that we can learn a lot from the way scientists conduct their research, as they are better-funded than we are. But, dare I say it, the builders of this project could also do worse than to consult and engage with digital humanists who have spent a lot of time thinking about innovative and robust solutions to these problems in ways that scientists have not necessarily had to.
This interesting post over at New Scientist Tech:
Bernie Krause has spent 40 years collecting over 3500 hours of sound recordings from all over the world, including bird and whale song and the crackle of melting glaciers. His company, Wild Sanctuary in Glen Ellen, California, has now created software to embed these sound files into the relevant locations in Google Earth. Just zoom in on your chosen spot and listen to local sounds.
“Our objective is to bring the world alive,” says Krause. “We have all the continents of the world, high mountains and low deserts.”
He hopes it will make virtual visitors more aware of the impact of human activity on the environment in the years since he began making and collecting the recordings. Users will be able to hear various modern-day sounds at a particular location, then travel back in time to compare them with the noises of decades gone by.
This is more than just a cool mashup of sounds with locations; the idea has repercussions in all sorts of departments, not least technical. At the end of the NS article is a note:
Another project, called Freesound, is making contributors’ sound files available on Google Earth. Unlike these recordings, Krause’s sound files are of a consistent quality and enriched with time, date and weather information.
Freesound is a Creative Commons site and more interesting from the Web 2.0 perspetive, as content is freely user-generated. What is exciting is the way that sites can make all sorts of media available through georeferences in Google Earth/Maps now (as for example the Pleiades Project are doing with classical sites). The question will be how such rich results are filtered: will Google provide overlays that filter by more than just keywords, or will third-party sites (like Wild Sanctuary and Pleiades) need to create web services that take advantage of the open technologies but provide their own filters? (Tom can probably answer these questions already…)
Bill Turkel has started what looks to be an important and potentially influential thread on the nexus of history and the digital. His opening salvo:
Teaching history students how to use computers was a really good idea in the early 1980s. It’s not anymore. Students who were born in 1983 have already graduated from college. If they didn’t pick up the rudiments of word processing and spreadsheet and database use along the way, that’s tragic. But if we concentrate on teaching those things now, we’ll be preparing our students for the brave new world of 1983.
Posts so far:
Type Greek is a web-based software tool that converts text from a standard keyboard into beautiful, polytonic Greek characters as you type. Using an easy-to-learn and standardized system called beta code, TypeGreek converts your keystrokes into Unicode-compliant Greek in real-time… The TypeGreek code is released under a Creative Commons license, so you are free to download it, modify it, or host it on your own site.
An interesting new project at Heidelberg:
“Propylaeum-DOK, der Volltextserver der Virtuellen Fachbibliothek Altertumswissenschaft, Propylaeum wird von der Universitätsbibliothek Heidelberg bereitgestellt. Die Publikationsplattform bietet Wissenschaftlerinnen und Wissenschaftlern weltweit die Möglichkeit, ihre Veröffentlichungen aus allen Fachbereichen der Altertumswissenschaften kostenlos und in elektronischer Form nach den Grundsätzen des Open Access im WWW verfügbar zu machen. Die Arbeiten werden mit standardisierten Adressen (URN) und Metadaten (OAIPMH) dauerhaft zitierfähig archiviert. Sie sind damit in verschiedenen Bibliothekskatalogen und Suchmaschinen weltweit recherchierbar.”
Seen on Humanist:
Announcing TAPoR version 1.0
We have just updated the Text Analysis Portal for Research (TAPoR) to
version 1.0 and invite you to try it out.
The new version will not appear that different from previous
versions. The main difference is that we are now tracking data about
tool usage and have a survey that you can complete after trying the
portal in order to learn more about text analysis in humanities
You can get a free account from the home page of the portal. If you
want an introduction you can look at the following pages:
Streaming video tutorials are at
A tour, tutorial, and useful links are available on the home page,
Please try the new version and give us feedback.
Google has just announced work on OCRopus, which it says it hopes will ‘advance the state of the art in optical character recognition and related technologies.’ OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. ‘The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.’
The project is expected to run for three years and support three Ph.D. students or postdocs. We are announcing a technology preview release of the software under the Apache license (English-only, combining the Tesseract character recognizer with IUPR layout analysis and language modeling tools), with additional recognizers and functionality in future releases.
It would be interesting to learn how this application compares in accuracy and power with commercial OCR systems (which have apparently gotten much better since the days when I used to get very frustrated with Omnipage and the like).