Chapter 13 – Archiving challenges
321
formation is captured. For a resource to be well-documented, it must be
defined what kind of character encoding is used so that software that under-
stands the format can select certain algorithms for the correct interpretation
(see Chapter 14 for details).
For sound digitization, one major encoding format is linear PCM (Pulse
Code Modulation) which is widely used for high quality material sampled at
44.1 /48 kHz (or higher) with a resolution of 16 bits (or higher). Alternative
formats such as MP3 and ATRAC (MiniDisc) involve highly compressed
encodings. While principles for compressed encoding may change over
time dependent on technology, the direct digital linear PCM encoding will
not change. The interpretation of the corresponding bit streams is very
straightforward, which makes it the perfect choice for archiving. For further
discussion, see Wittenburg, Skiba et al. (2004).
For digital images, JPEG encoding is widely used nowadays, which
however performs a lossy compression on the original material. A high
compression factor here leads to a blurring of sharp lines or contrasts. TIFF
is an uncompressed digital image representation format, but not yet fully
standardized. JPEG is openly documented and we can expect that the algo-
rithm and the knowledge will be available for many years to come. For the
future, we expect that more devices will provide direct digitized formats or
formats such as PNG that apply lossless compression.
For a number of years to come, compressed formats will be the only
feasible choice for moving images. Currently, MPEG2 is a commonly used
backend format for archiving. It can be derived from the DV format that is
currently the most common format for digital video cameras on the con-
sumer and low-end professional market. Due to its wide distribution and its
open documentation, we can expect that MPEG2 knowledge will be avail-
able for many years. Nevertheless, new encoding ways will emerge with
the steady increase in available storage capacity and network bandwidth.
In general, we can state that for long-term preservation purposes it is
important (1) to rely on uncompressed and high quality data representation
wherever possible; (2) to make sure that the encoding principles are simple
and well-documented, and (3) that the encoding standard is under non-
proprietary control. There are many such widely-accepted standards avail-
able today and current trends show that more of them will be developed in
the near future.
322
Paul Trilsbeek and Peter Wittenburg
3.2.2. Text structures and file formats
When looking at multi-layered annotations or lexica, we can find that char-
acters are embedded in structures and form interpretational units such as
words, glosses, part-of-speech indicators, and others. Not only for compu-
tational reasons it makes sense to identify the structural components explic-
itly by means of tags and a structure description language such as XML. A
complete documentation will require that the structure of textual documents
has to be made explicit and that all tags that are used to indicate structure
are documented. An XML schema, a RelaxNG schema, or a DTD is the
best way to define the structure of documents and to control the correctness
of the files. Yet we lack generic schemas with a wide acceptance for highly
structured linguistic document types such as annotations and lexica. Until
organizations such as ISO finish their proposals for standards, archives have
to rely on a number of XML formats that are widely used (see Chapters 4
and 14 for details).
Closely related to the issue of text structure is the file format issue. File
formats define the way in which information is packaged. In general, the
file extension says something about the format of a file, but this is not very
reliable. Many file formats encode some format information in the header,
i.e. the first number of bytes of a file. But in order to secure future inter-
pretability, file formats have to be explicitly documented.
3.2.3. Organizational aspects
In a language archive relations of various kinds can be found between vari-
ous resources. The most relevant relations from an organizational point of
view are:
– resources documenting a certain language
– resources that were created during a certain field trip
– resources that share a certain genre
– resources covering different media (sound, video, etc.) pertaining to the
same recording
– transcriptions and other annotations that relate to a certain sound file
– a lexicon which was extracted from a number of annotations.
These relations may be obvious for the researcher who created the docu-
mentation, but in an archive these relations have to be made explicit to
Chapter 13 – Archiving challenges
323
make the archive manageable and the information accessible to users. Only
explicit metadata descriptions accompanying each resource will be able to
provide the necessary information. Currently, there are two widely used
metadata sets for language resources which serve somewhat different pur-
poses. The OLAC set (an extension of the Dublin Core set) was designed to
facilitate searching in integrated metadata domains. Its function thus is
quite similar to that of a catalogue in a large library. The IMDI metadata
tool already mentioned above is a result of intensive bottom-up discussions
within the language engineering and field linguist communities. It was de-
signed to cover all the relations mentioned above, to support browsing and
searching and the management of resources. It thus combines the catalogu-
ing function of metadata with the function of a corpus management tool. It
includes an extended set of metadata elements and enables the creation of
hierarchies and bundles. It is based on an XML schema comprising defini-
tions of the semantics of the elements used, and it has controlled vocabular-
ies associated with it so that a high degree of consistency can be achieved.
This is crucial for retrieval.
Figure 3 gives an example of a simplified IMDI corpus structure from
the DoBeS archive, showing how resources such as field notes can be
linked to corpus nodes. The resource metadata descriptions can be used to
bundle related resources such as a video and a sound file with all associated
annotations.
Figure 3. Example of a hierarchical organization of resources
Audio/video files, annotations, field notes, lexica, etc.
DOBES
Linguistic
TRUMAI
TSAFIKI
Elicitations
Natural use
Non-Linguistic
Stories
Resources
Archive structure nodes
Resource metadata descriptions
Information files,
field notes, etc.
Conversation
Dostları ilə paylaş: |