Chapter 4 – Data and language documentation
95
notes1.doc in another directory (e.g. for your 2005 notes) then any loss of
directory information will result in confusion between these files. Different
naming schemes can be used, but clarity and transparency is the goal – see
Johnson (2004) for some suggestions. It is also essential to record the rele-
vant metadata for the data files you create as you make them, ideally in a
structured way such as a relational table using standard terminology.
3. Processing the materials
3.1. Linguistic processing
Processing the documentary materials is a very different operation from
recording
and capture, and operates on a very different time scale. Thus
each minute of audio can take hours to process in terms of transcription and
annotation (depending on familiarity with the language and the richness of
the annotation), while video is even more labor-intensive and requires
much more time to process. Video may require cutting and converting to
create manageable chunks and file sizes (this is done with computer soft-
ware
2
). There are several tools that are useful for transcription and annota-
tion (see below).
Linguistic analysis, that is transcription, translation, and annotation,
requires decisions about representation, i.e. the levels and types of units.
This should make sense within the researcher’s chosen framework (theory)
and needs to be made clear in the structural metadata that accompanies the
relevant files.
There are good reasons for aiming at a certain degree of standardization
when processing the materials, including transparency, portability, and ease
of sharing and access (Bird and Simons 2003). Phonetic transcription
should follow the conventions of the International Phonetic Association
(IPA), and phonemic transcription should be IPA or a regionally-recog-
nized standard. Grammatical annotation tags (i.e. the abbreviated labels for
e.g. part of speech categories) should follow general linguistic practice, e.g.
the recommendations of EUROTYP or E-MELD (including its GOLD on-
tology), with a list of relevant abbreviations and symbols provided as meta-
data (for further discussion, see Chapter 9 and Leech and Wilson 1996).
96
Peter K. Austin
For processed data we need to distinguish between the following:
1.
Character encoding – how characters are represented, e.g. Windows /
ANSI, Unicode, UTF-8, Big5, JISC.
2.
Data encoding – how meaningful structures in the data are marked, e.g.
extensible markup language (XML), Shoebox/Toolbox standard markers,
MS Word table.
3.
File encoding – how the data is packaged into a digital file, e.g. plain
text, MS Word, PDF, Excel spreadsheet.
4.
Physical storage medium – the physical form used to store the file, e.g.
CD-ROM, minidisk, DAT, hard disk, flash memory stick.
As an example, certain documentary materials might be encoded as a hard
disk file in plain text Unicode Toolbox format (for further discussion and
examples, see Chapter 14).
When we consider file encoding it is useful to distinguish between pro-
prietary formats and
non-proprietary formats. A proprietary
format is one
whose structure is determined and owned by the maker of the software that
stores it, e.g. MS Word, Excel, Access, FileMaker Pro, or Sony ATRAC
(the audio format on minidisk). As such, this means that the data is not di-
rectly accessible, and the format is subject to change (so that attempting to
open a file stored in one version of the software with a later version may
not always work – see Chapter 14 for examples). As a result, proprietary
formats are not ideal for long-term storage (i.e. the encoding is not portable
and reusable). Non-proprietary formats, e.g. Unicode plain text, or wav
audio, are open and transferrable between hardware and software.
When processing the data it can be useful to distinguish three kinds of
contexts each requiring different data formats (see also Johnson 2004):
1.
working context – the way the data is stored for on-going
research work
of annotation and analysis;
2.
archiving context – how the materials are to be stored for long-term
preservation (see below);
3.
presentation context – the form of the data for distribution and publica-
tion.
Researchers need to develop ways to flow data between contexts, typically
by exporting the data into some structured format that the software used for
other contexts can read (see Thieberger 2004 for some examples). Thus, a
common working format for text annotation is Shoebox/Toolbox; this can
Chapter 4 – Data and language documentation
97
be exported into rich text format (RTF) to be read by MS Word in order to
produce presentation format PDF documents for printing and distribution.
Table 2 gives examples of the different format types for the three contexts.
Table 2. Data formats in different contexts
Working
Archiving Presentation
Text
Word, XLS, FMpro,
Shoebox/Toolbox
XML
PDF, HTML
Audio WAV
WAV,
BWF MP3, WMA, RA
Video MPEG2
MPEG2,
MPEG4
QuickTime, AVI, WMV
As an illustration, Figure 1 is a screen shot which shows Shoebox format
working context data for the Australian Aboriginal Guwamu language.
3
In
the window on the top
left is lexical information, on the lower left is elic-
ited sentence data with morpheme-by-morpheme glossing annotation and
free translation, on the top right is descriptive metadata about the people
involved in the project, and on the bottom right metadata about abbrevia-
tions used in the lexical and sentence annotations. Note that the metadata is
hypertextually linked to the data in the two left-hand windows, while the
lexical root is hypertextually linked from the morpheme field in the sen-
tence window, and the sentence number links from the example field in the
lexicon.
A possible presentation form of the illustrated lexical entry is the following:
bawurra n
male red kangaroo, Note: used as a generic
term for kangaroos, cf. gula, gumbarr,
dhugandu, [SAW, WW], e.g. Gu206, Gu255