Long-term data preservation

December 14th, 2007 by Gabriel Bodard

There was an article in New Scientist last week on plans for permanent data preservation for the scientific data. The argument in the sciences seems to be that all data should be preserved, as some of it will be from experiments that are unrepeatable (in particular Earth observations, astronomy, particle accelerators, and other highly expensive projects that can produce petabytes of data). It is a common observation that any problems we have in the humanities, the sciences have in spades and will solve for us, but what is interesting here is that the big funding being thrown at this problem by the likes of the NSF, ESF, and the Alliance for Permanent Access is considered news. This is a recognised problem, and the sciences don’t have the solution yet… Grid and Supercomputing technologies are still developing.

(Interestingly, I have heard the argument made in the humanities that on the contrary, most data is a waste of space and should be thrown away because it will just make it more difficult for future researchers to find the important stuff among all the crap. Even in the context of archaeology, where one would have thought practitioners would be sensitive to the fragile nature of the materials and artefacts that we study, there is a school of thought that says our data–outside of actual publications–are just not important enough to preserve in the long term. Surely in the Googleverse finding what you want in a vast quantity of information is a problem with better solutions than throwing out stuff that you don’t think important and therefore cannot imagine anyone else finding interesting.)

Another important aspect of the preservation article is the observation that:

Even if the raw data survives, it is useless without the background information that gives it meaning.

We have made this argument often in Digital Humanities venues: raw data is not enough, we also need the software, the processing instructions, the script, presentation, search, and/or transformation scenarios that make this data meaningful for our interpretations and publications. This is in technical terms the equivalent of documenting experimental methodology to make sure that research results can be replicated, but it also as essential and providing binary data and documenting the format so that this data can be interpreted as structured text (say).

It’s good to see that this is a documented issue and that large resources are being thrown at it. We shall watch their progress with great interest.

Leave a Reply