98
Peter K. Austin
Figure 1. Working with Shoebox
Note that in the presentation format, typography (e.g. italics, bolding, font
type, indentation) and dictionary literacy conventions are employed to par-
tially represent the data structure (see Nichols and Sprouse 2003 for other
examples). The sentence example can be presented as follows:
ngaya
banbalguya
nhunga
yilunha
bawurra
ngaya banba-lgu-ya
nhunga
yilu-nha bawurra
1sgnom spear-fut-1sg 3sgacc this-acc k.o.kangaroo
pro vtr-suff-suff
pro dem-suff
n
‘I will spear this red kangaroo’
[SAW, WW, Np12As004]
Linguists’ conventions (such as the ‘Leipzig Glossing Rules’ – see http://
www.eva.mpg.de/lingua/files/morpheme.html) have been established for an-
notated text so that, as in the given example, horizontal and vertical align-
ment on the page represents relationships between different types of data.
4
Chapter 4 – Data and language documentation
99
Lost in the flow
The data structures encoded in these Shoebox files are relatively complex (see
the diagram in the Appendix below, and Austin 2005) but the links between
the data fields are lost in the process of export to RTF and presentation on the
printed page. Note that the links could be captured in a HTML file, however,
and thus be available to be viewed with a web browser. We discuss archival
formats for these examples below.
3.2. Tools for linguistic analysis and
processing
There are a range of computational resources that facilitate creating, view-
ing, querying, or otherwise using language data. They include application
programs, components, fonts, style sheets, and document type definitions
(DTD). Application programs can be classified into two types:
1.
general purpose software for which the user must design the data struc-
tures and can write application programs to manipulate the data and
carry out various tasks. Examples are MS Word and Excel, and File-
Maker Pro. Such software is powerful and flexible, however, they store
data in a proprietary format which is not optimal for long-term storage
and access;
2.
specific purpose software which is designed to be used for particular
tasks. Examples of such software in common use by language docu-
menters include: Transcriber and EXMARaLDA (EXtensible MARkup
Language for Discourse Annotation – see Schmidt 2004) for time-
aligned audio annotations, Shoebox/Toolbox for text and lexicon
annotations, Praat for speech analysis and annotation, ELAN for audio
and video annotation, and IMDI Browser for cataloguing and admini-
stration metadata.
Some of the specific purpose software is discussed and illustrated else-
where in this volume.
100
Peter K. Austin
Other useful software
In addition to the tools mentioned above, there also exist converter programs
for transferring data between encoding formats, such as those developed at
MPI-Nijmegen for uniting Transcriber and Shoebox encoded files, and con-
verting them to XML for use with ELAN. Further information about available
programs and computational resources can be found at the E-MELD ‘School
of Best Practice’ website and in the list of resources at the back of this volume.
3.3. Archiving
Digital archiving involves the preparation of the recorded/captured data,
metadata, and processed analysis so that the information it contains is
maximally informative and explicitly expressed, encoded for long-term ac-
cessibility and safely stored with a reputable organization that can guarantee
long-term curation. A number of digital language and music archives exist;
the DELAMAN network created in 2003 links many of them (see resources
list). Digital archiving offers opportunities to store data for communities to
use, other scholars to access, and for preservation for future generations of
community members, the general public, and researchers. Note that not all
recorded data has to be archived (e.g. unprocessed video files) but we
should aim to make our materials archivable, that is, richly structured docu-
mentations maximizing the possibilities of the digital medium. Archiving
must be included as a process in our language documentation project plans,
and it is advisable to seek assistance with planning for archiving from an
archivist at the beginning of project conception.
Note that archiving is not publication (only those materials prepared for
distribution will be published by the archive), nor is it backup (the archive
will generally not accept backup copies of files alone but will expect the
data and metadata to be explicitly described, often by requesting that de-
posit forms be completed for each archival object). Archives also com-
monly have systems in place to manage protocols for intellectual property
rights, and for specification of access and usage rights, e.g. that a certain
archival object is only available to members of the speaker community. The
depositor should establish these by discussion and negotiation with the
owners, and describe them via metadata and deposit protocols. Data sensi-
tivity is not a reason to not archive; it is better to deposit data in an archive
with restrictions than not deposit at all. Researchers should also make