10
Nikolaus P. Himmelmann
other things, native speakers may be more comfortable to discuss metalin-
guistic knowledge without being constantly recorded). But, to repeat, regard-
less of the recording method, records of observable linguistic behavior and
metalinguistic knowledge both contain primary data documenting linguistic
interactions in which native speakers participate.
In the following, we will use the label corpus of primary data as a short-
hand for
corpus of recordings of observable linguistic behavior and meta-
linguistic knowledge for this component of a language documentation.
Throughout this book it is assumed that this corpus is stored and made
available in digital form.
To date, there is very little practical experience with regard to structuring
and maintaining such digital corpora. Consequently, no widely-used and
well-tested structure exists for them. Within the DoBeS program, it is a
widespread practice to operate with two basic components in structuring
primary data: records of individual communicative events and a lexical
database (this obviously follows a widespread practice in linguistic field-
work where apart from transcripts of recordings and fieldnotes the compila-
tion of a lexical database is a standard procedure).
Records of individual communicative events are called sessions (alter-
native terms would be “document”, “text”, or “resource bundle”). In the
manual for the IMDI Browser,
4
a session is defined as “a meaningful unit of
analysis, usually […] a piece of data having the same overall content, the
same set of participants, and the same location and time, e.g., one elicita-
tion session on topic X, or one folktale, or one ‘matching game’, or one
conversation between several speakers.” It could also be the recording of a
two-day ceremony. Sessions are typically allocated to different sets defined
according to parameters such as medium (written vs. spoken), genre (mono-
logue, dialogue, historical, chatting, etc.), naturalness (spontaneous, staged,
elicited, etc.), and so on. It is too early to tell whether some of the various
corpus structures currently being used are preferable to others.
There are two reasons why a lexical database appears to be a useful
format for organizing primary data. On the one hand, there is a need to
bring together all the information available for a given item so that one can
make sure that the meaning and formal properties of the item are well un-
derstood.
5
On the other hand, and perhaps more importantly, a list of lexical
items is a very useful resource when working on the transcription and trans-
lation of recordings. One of the most widely used computational tools in
descriptive linguistics, the program Toolbox (formerly Shoebox),
6
allows for
the semi-automatic compilation of a lexical database when working through
Chapter 1 – Language documentation: What is it and what is it good for?
11
a transcript, and the existence of this program is certainly one reason why
the compilation of a lexical database currently is almost an automatic pro-
cedure when working with recordings. However, as with all other aspects
of organizing a digital corpus of primary data, it remains to be seen and
tested further whether this is indeed a necessary and useful procedure.
3.1.2. Apparatus
Inasmuch as linguistic and metalinguistic interactions cover the range of
basic interactional possibilities,
7
a documentation which contains a com-
prehensive set of primary data for both types of interactions is logically
complete with regard to the level of primary data. However, it is well
known that a large corpus of primary data is of little use unless it is pre-
sented in a format which ensures accessibility for parties other than the
ones participating in its compilation. To be accessible to a broad range of
users, including the speech community, the primary data need to be accom-
panied by information of various kinds, which – following philological
tradition – could be called the apparatus. The precise extent and format of
the apparatus is a matter of debate, with one exception: the uncontroversial
need for metadata.
Metadata are required on two levels. First, the documentation as a whole
needs metadata regarding the project(s) during which the data were com-
piled, including information on the project team(s), and the object of docu-
mentation (which variety? spoken where? number and type of records; etc.).
Second, each session (= segment of primary data) has to be accompanied
by information of the following kind:
8
–
a name of the session which uniquely identifies it within the overall
corpus;
–
when and where was the data recorded?;
–
who is recorded and who else was present at the time?;
–
who made the recording and what kind of recording equipment was
used?;
–
an indication of the quality of the data according to various parameters
(recording environment and equipment, speaker competence, level of
detail of further annotation);
–
who is allowed to access the data contained in this session?;