Essentials of Language Documentation

Yüklə 5,72 Mb.

Pdf görüntüsü

səhifə	121/144
tarix	22.07.2018
ölçüsü	5,72 Mb.
	#57633

1 ... 117 118 119 120 121 122 123 124 ... 144

Chapter 13 – Archiving challenges

321

formation is captured. For a resource to be well-documented, it must be

defined what kind of character encoding is used so that software that under-

stands the format can select certain algorithms for the correct interpretation

(see Chapter 14 for details).

For sound digitization, one major encoding format is linear PCM (Pulse

Code Modulation) which is widely used for high quality material sampled at

44.1 /48 kHz (or higher) with a resolution of 16 bits (or higher). Alternative

formats such as MP3 and ATRAC (MiniDisc) involve highly compressed

encodings. While principles for compressed encoding may change over

time dependent on technology, the direct digital linear PCM encoding will

not change. The interpretation of the corresponding bit streams is very

straightforward, which makes it the perfect choice for archiving. For further

discussion, see Wittenburg, Skiba et al. (2004).

For digital images, JPEG encoding is widely used nowadays, which

however performs a lossy compression on the original material. A high

compression factor here leads to a blurring of sharp lines or contrasts. TIFF

is an uncompressed digital image representation format, but not yet fully

standardized. JPEG is openly documented and we can expect that the algo-

rithm and the knowledge will be available for many years to come. For the

future, we expect that more devices will provide direct digitized formats or

formats such as PNG that apply lossless compression.

For a number of years to come, compressed formats will be the only

feasible choice for moving images. Currently, MPEG2 is a commonly used

backend format for archiving. It can be derived from the DV format that is

currently the most common format for digital video cameras on the con-

sumer and low-end professional market. Due to its wide distribution and its

open documentation, we can expect that MPEG2 knowledge will be avail-

able for many years. Nevertheless, new encoding ways will emerge with

the steady increase in available storage capacity and network bandwidth.

In general, we can state that for long-term preservation purposes it is

important (1) to rely on uncompressed and high quality data representation

wherever possible; (2) to make sure that the encoding principles are simple

and well-documented, and (3) that the encoding standard is under non-

proprietary control. There are many such widely-accepted standards avail-

able today and current trends show that more of them will be developed in

the near future.

322

Paul Trilsbeek and Peter Wittenburg

3.2.2. Text structures and file formats

When looking at multi-layered annotations or lexica, we can find that char-

acters are embedded in structures and form interpretational units such as

words, glosses, part-of-speech indicators, and others. Not only for compu-

tational reasons it makes sense to identify the structural components explic-

itly by means of tags and a structure description language such as XML. A

complete documentation will require that the structure of textual documents

has to be made explicit and that all tags that are used to indicate structure

are documented. An XML schema, a RelaxNG schema, or a DTD is the

best way to define the structure of documents and to control the correctness

of the files. Yet we lack generic schemas with a wide acceptance for highly

structured linguistic document types such as annotations and lexica. Until

organizations such as ISO finish their proposals for standards, archives have

to rely on a number of XML formats that are widely used (see Chapters 4

and 14 for details).

Closely related to the issue of text structure is the file format issue. File

formats define the way in which information is packaged. In general, the

file extension says something about the format of a file, but this is not very

reliable. Many file formats encode some format information in the header,

i.e. the first number of bytes of a file. But in order to secure future inter-

pretability, file formats have to be explicitly documented.

3.2.3. Organizational aspects

In a language archive relations of various kinds can be found between vari-

ous resources. The most relevant relations from an organizational point of

view are:

– resources documenting a certain language

– resources that were created during a certain field trip

– resources that share a certain genre

– resources covering different media (sound, video, etc.) pertaining to the

same recording

– transcriptions and other annotations that relate to a certain sound file

– a lexicon which was extracted from a number of annotations.

These relations may be obvious for the researcher who created the docu-

mentation, but in an archive these relations have to be made explicit to

Chapter 13 – Archiving challenges

323

make the archive manageable and the information accessible to users. Only

explicit metadata descriptions accompanying each resource will be able to

provide the necessary information. Currently, there are two widely used

metadata sets for language resources which serve somewhat different pur-

poses. The OLAC set (an extension of the Dublin Core set) was designed to

facilitate searching in integrated metadata domains. Its function thus is

quite similar to that of a catalogue in a large library. The IMDI metadata

tool already mentioned above is a result of intensive bottom-up discussions

within the language engineering and field linguist communities. It was de-

signed to cover all the relations mentioned above, to support browsing and

searching and the management of resources. It thus combines the catalogu-

ing function of metadata with the function of a corpus management tool. It

includes an extended set of metadata elements and enables the creation of

hierarchies and bundles. It is based on an XML schema comprising defini-

tions of the semantics of the elements used, and it has controlled vocabular-

ies associated with it so that a high degree of consistency can be achieved.

This is crucial for retrieval.

Figure 3 gives an example of a simplified IMDI corpus structure from

the DoBeS archive, showing how resources such as field notes can be

linked to corpus nodes. The resource metadata descriptions can be used to

bundle related resources such as a video and a sound file with all associated

annotations.

Figure 3. Example of a hierarchical organization of resources

Audio/video files, annotations, field notes, lexica, etc.

DOBES

Linguistic

TRUMAI

TSAFIKI

Elicitations

Natural use

Non-Linguistic

Stories

Resources

Archive structure nodes

Resource metadata descriptions

Information files,

field notes, etc.

Conversation

Yüklə 5,72 Mb.

Dostları ilə paylaş:

1 ... 117 118 119 120 121 122 123 124 ... 144