104
Peter K. Austin
Table 4 shows an XML representation of the Guwamu sentence shown
above. Note that the XML representation makes explict the sequential order
of words in the sentence, and the relationships between elements, e.g. word
forms and their constituent morphemes, which are purely implicit in typical
working format (Shoebox) and presentation format (printed example) which
rely on horizontal and vertical alignment on the page or screen to signal the
relationships.
Table 4. Example of an XML structure (Guwamu sentence)
Gu255
ngaya banbalguya nhunga yilunha bawurra
I will spear this red kangaroo
SAW
WW
[Np12As004]
pronoun co-occurrence with demonstrative and noun;
demonstrative inflected for accusative case
03/Apr/2005
ngaya
I
ngaya
pro
pro
1sgnom
banbalguya
will spear
banba
v
vtr
spear
lgu
suff
vinfl
106
Peter K. Austin
Again, we can view this representation using XML-aware software and see
its hierarchical structure; firstly in terms of a sentence made up of a se-
quence of words as in Figure 3.
Figure 3. XML structure display (Guwamu sentence, sentence level)
Now, if we view the information about words in the sentence in detail as in
Figure 4 we see that they consist of one or more morphemes in sequence
(notice that the triangle icon on the left margin changes from horizontal to
vertical as we move down the hierarchy).
More on archival format
Note that the information stored in the XML representation is extremely com-
pact but is still readable by humans and the structure can be recovered, even if
the software to display the data is missing; this is why XML is a good archival
format. For more information on archival encoding, see the Text Encoding Ini-
tiative (http://www.tei.org) or the resources websites listed at the end of this
book. There are numerous introductory textbooks for XML, though none of
them explicitly deals with language documentation issues.
Chapter 4 – Data and language documentation
107
Figure 4. XML structure display (Guwamu sentence, word level)
3.3.2. Archiving sound and video
The formats for real-time media are subject to rapid technological change
and one of the major concerns of archives is to attend to refreshing files
(‘forward migration’) so that they remain readable to the existing equip-
ment. For video, there are two internationally-agreed compressed formats,
namely MPEG2 and MPEG4, however there is no agreement about raw
formats which in any case are extremely difficult to store due to the very
large file size. For audio recordings, archives generally use uncompressed
CD-ROM-quality (44kHz, 16 bit) encoded as WAV files; some archives
also use 48kHz and/or BWF (‘broadcast wave format’) where metadata is
bundled together with the audio. Note that MP3, RealAudio, or Windows