Chapter 4
Data and language documentation
Peter K. Austin
Introduction
The role of data in language documentation is rather different from the way
that data is traditionally treated in language description. For description, the
main concern is the production of grammars and dictionaries whose pri-
mary audience are linguists (Himmelmann 1998; Woodbury 2003). In these
products language data serves essentially as exemplification and support for
the linguist’s analysis. It is typically presented as individual example sen-
tences, often without source attribution, and often edited to remove ‘irrele-
vant material’. There may also be a ‘sample text’ or two in an appendix to
the grammar. Language documentation, on the other hand, places data at
the center of its concerns. Woodbury (2003: 39) proposes that
direct representation of naturally occurring discourse is the primary project,
while description and analysis are contingent, emergent byproducts which
grow alongside primary documentation but are always changeable and para-
sitic on it.
For language documentation then, data collection, representation and diffu-
sion is the main research goal with grammars, dictionaries, and text collec-
tions as secondary, dependent products that annotate and comment on the
documentary corpus. The audience for language documentation is also very
wide, encompassing not only linguists and researchers from other areas
such as anthropology, musicology, or oral history, but also members of the
speech community whose language is being documented, as well as other
interested people. A significant concern for documentation is archiving, to
ensure that materials are in a format for long-term preservation and future
use, and that information about intellectual property rights and protocols
for access and use are recorded and represented along with the data itself.
Important also is ‘mobilization’ of materials (cf. Chapter 15), i.e. generation
of resources in support of language maintenance and/or learning, especially
where the documented languages are endangered and in need of support.
88
Peter K. Austin
Woodbury (2003: 46 –47) argues that a good documentation corpus should
be:
1.
diverse – containing samples of language use across a range of genres
and socio-cultural contexts, including elicited data;
2.
large – given the storage and manipulation capabilities of modern infor-
mation and communications technology (ICT), a digital corpus can be
extensive and incorporate both media and text;
3.
ongoing, distributed, and opportunistic – data can be added to the corpus
from whatever sources that are available and be expanded when new
materials become available;
4.
transparent – the corpus should be structured in such as way as to be
useable by people other than the researcher(s) who compiled it, in-
cluding future researchers;
5.
preservable and portable – prepared in a way that enables it to be ar-
chived for long-term preservation and not restricted to use in particular
ICT environments;
6.
ethical – collected and analyzed with due attention to ethical principles
(see Chapter 2) and recording all relevant protocols for access and use.
This
means
the
corpus
must
be
stored
digitally
and
ideally
collected
digitally.
In this chapter we outline the major processes involved in collecting and
representing language data in a documentation framework, briefly discuss
the tools that are available to assist with this work, and illustrate some of
the products that documentary linguists have developed to present the re-
sults of their research. Further technical details about data structures and
encoding, tools, archiving, and outputs can be found in other chapters in
this volume (see Chapters 13, 14, 15).
It is important to emphasize that language documentation is a develop-
ing field that has emerged only recently and that is undergoing rapid change
in terms of both theory and practice. It can be anticipated that much of what
is presented in this chapter will be subject to change and development in
coming years.
1.
Processes in language documentation
Language documentation begins with the development of a project to work
with a speech community on a language and can be seen as progressing
Chapter 4 – Data and language documentation
89
through a series of stages, some of which are carried out in parallel. In the
following we discuss the processes that involve data collection, processing,
and storage. These can be identified as follows:
1.
recording – of media (audio, video, image) and text;
2.
capture – moving analogue materials to the digital domain;
3.
analysis
–
transcription, translation, annotation, and notation of metadata;
4.
archiving – creating archival objects, and assigning access and usage
rights;
5.
mobilization – publication, and distribution of the materials in various
forms.
Note that at the time when a documentation project is being developed each
of these processes should be considered and relevant procedures included
in the project planning. In particular, archiving and mobilization must be
considered from the beginning of the project and not left to the end of the
project or as an afterthought (see further below).
A crucial aspect that must be kept in mind at all stages is backup.
Backup
It is prudent for any project, and especially one involving digital ICT, to de-
velop a regular and effective regime of backing up the project data, ideally on
a range of different media (e.g. CD-ROM, DVD, flash memory, external hard
disk). Backups should be incremental and intended for full recovery, should
disaster strike. One widely agreed mantra is LOCKSS “lots of copies keep stuff
safe” (see http://www.lockss.stanford.edu). Remember, it is highly likely that
you will lose data at some point in your project work, however, a good backup
regime will ensure that such loss can be minimized.
2.
Documentation processes – recording, metadata creation, and cap-
turing
2.1. Recording
A good documentation corpus will include audio and/or video materials,
ideally recorded in authentic settings and under good conditions. When
recording outdoors, if possible attempt to minimize noise from animals,
Dostları ilə paylaş: |