318
Paul Trilsbeek and Peter Wittenburg
ture and content, therefore users can easily make errors when entering
data, which leads to inconsistencies in the archive and difficulties in
finding certain resources.
4. Many users are used to HTML-based web pages and like to see material
presented in this way. Archivists avoid storing material in an HTML
representation format since it is limited with respect to structural expres-
siveness and it mixes representation and presentation issues, i.e. it is bi-
ased towards certain users (see Chapter 14).
There is a basic difference that underlies most of these possible conflicts.
This is the difference between the preservation requirements for the long-
term uses of the information stored in the archive and the more short-term
exigencies of depositors and users. There is a concomitant difference be-
tween presentation and storage (or re-presentation) formats. The term “pre-
sentation” here refers to the way data are presented to users, i.e. it addresses
the surface form. The storage format pertains to the way data are stored.
This should be as neutral as possible with regard to different presentation
formats. That is, it should be coherently structured, its different information
types should be tagged explicitly, and it should make use of open, well-
documented, and widely accepted standards.
Storage formats address long-term preservation needs, while presenta-
tion formats play a role in short term access issues. We will now look more
closely at what is involved in this basic difference.
3. Long-term preservation requirements
Digital long-term archiving has to address two fundamental tasks:
– to ensure the survival of bit streams which is threatened by the limited
life span of media carriers (tape, CD-ROM, etc.) and all kinds of possible
disasters affecting such carriers;
– to ensure the interpretability of the information represented as bit
streams, including the preservation of the structure of the material.
The survival of bit streams, i.e. the basic binary patterns stored on a me-
dium, is of course crucial for the second problem. Given that a bit stream is
preserved, one could speculate that “data archeologists” will be able to de-
velop methods to interpret the data, even if the basic information on how to
decode the bit stream and how to reshape it into resources is lost.
Chapter 13 – Archiving challenges
319
3.1. Preserving bit streams
In contrast to the cuneiform characters on the clay tables of the Sumerians,
the patterns stored on our current magnetic and opto-magnetic storage me-
dia have a comparatively short life span. An average hard disk has a media
life span of four years, for CD-ROMs we see specifications of up to 30 years
for the accessibility of the stored patterns, and for other storage media the
expectations are of similar order. This is all very short and cannot be satis-
fying when we speak about long-term preservation. With regard to language
archives covering several terabytes of data, however, there are no other
options at this point than to rely on the classical magnetic tape and disk
technologies, for practical and financial reasons.
Another factor that reduces the life span of the stored patterns on such
storage media can be found in the technological innovation cycle. In 30
years time, only specialized institutions will be able to support old devices
and read today’s CD-ROMs, for example, since new technologies will be on
the market and old devices will not be supported anymore by the industry.
Given a heavily reduced amount of devices, some resources will no longer
be readable for the very simple reason that access to these devices will be
limited.
The current solution to counter problems relating to storage media is to
continuously and automatically migrate data to new storage media and
widely distribute these data. Copying data to newer technology helps to
overcome the limited media life span and can be done largely automatically
if planned very carefully. Importantly, the copying process has to start
some time before the old technology becomes instable.
It is common knowledge that all kinds of disasters may occur: a disk can
become unreadable, a fire can destroy an entire computer center, etc. To
overcome these uncertainties we have to distribute copies of the data – a
strategy that was already applied to preserve books. However, in the digital
era it is easier to automatically create these copies and distribute them. Any
archive will apply both techniques within the archive as well as beyond.
Tests have to be carried out regularly to check whether the data exchange
protocols work correctly.
With regard to the DoBeS data, there are currently seven copies available
in four different locations (Nijmegen: 2, Munich: 2, Göttingen: 2, Leipzig:
1). Within the framework of the DELAMAN network (Digital Endangered
Language and Music Archive Network) it is intended to distribute the data
on a worldwide level.
320
Paul Trilsbeek and Peter Wittenburg
3.2.
Preserving interpretability
Even when we have assured that the bit streams will survive, we will be
faced with the problem of readability and interpretability of the information
contained in the bit stream. We can distinguish four layers that are relevant
here:
– the technical encoding of signals such as characters, images, sounds,
and videos;
– the encoding of text structure;
– the packaging and structuring of encoded streams into files;
– information regarding the bundling of resources, i.e. the organizational
structure of a given documentation.
3.2.1. Technical encoding
We are used to being able to perceive signals of different types via displays
and loudspeakers. However, on computers these signals are all stored as bit
patterns and packaged into files. Hence, the question arises how to ensure
that people 20 or even 500 years from now will still be able to tell what
kind of signal a given bit stream represents. The problem is visualized in
Figure 2. Does the shown bit stream encode a video sequence, does it en-
code Chinese characters, or does it encode some other type of information?
The bit stream itself does not reveal this.
Figure 2. The basic bit stream interpretation problem: What type of signal is en-
coded in a given stream?
In digital form, characters have to be stored in chunks of bits, video images
have to be digitized to represent the spatial and temporal information in
suitable ways, and sound files have to be encoded so that the relevant in-
011001010100001010110100101010
?
?
Dostları ilə paylaş: |