Chapter 4 – Data and language documentation
101
preparations for assigning their rights into the future by including informa-
tion in your will and ensuring that your executors understand how to assign
them on your death.
3.3.1. Archiving text materials
The preferred format for archiving text materials is eXtensible Markup
Language (XML), a document description
language used to encode the
content of structured documents (see Sperberg-McQueen and Burnard
2002). XML is a subset of SGML (standard generalized markup language)
and is used to explicitly describe a domain of knowledge through markup
tags enclosed in angle brackets (see Chapter 14 with the example of a ‘play
structure’ implicit in a published document). Each part of a structured do-
cument is described within a defined and logical structure (stored in XML
schemas or DTDs ‘document type definitions’). XML is a good archival
format because XML documents explicitly represent data structure, and are
directly readable by humans even if computer software to display the
documents is not available.
XML documents are typically created by export from working context
materials, rather than being directly written by the researcher, because the
process of writing well-structured XML tends to be tedious and error prone
(various XML editors exist and these can be used to create documents, to
check markup tag syntax [well formedness], to create DTDs, and to ensure
that a document complies with a schema or DTD). XML encoded docu-
ments can be transformed into various archival and presentation formats by
XSLT, extensible stylesheet language transformations. Thus, an XSLT
could create a concordance of an annotated text collection, or HTML files
for web publication. Archivists can provide advice on possible transforma-
tions of XML documents.
The following are two examples of XML encoding. First, consider the
structure of a typical bilingual lexicon (such as seen in the Guwamu example
presented above):
5
1.
lexicons contain entries;
2.
the attributes of entries are: form, category, subcategory, language,
meaning specification (and any other additional information such as
notes, speaker, recorder, sense relations, sentence examples);
3.
meaning specification can be gloss (for morpheme-by-morpheme gloss-
ing and finderlist production) and definition;
102
Peter K. Austin
4.
cross-references to other lexical entries have a sequential order chosen
by the lexicographer;
5.
cross-references to sentences examples also have a specified sequential
order.
Table 3 shows the Guwamu sample entry discussed above in XML form,
which would be a possible archival representation.
Table 3. Example of an XML structure (lexicon entry)
Gu
n
n
k.o.kangaroo
male red kangaroo
used as a generic term for kangaroos
SAW
WW
13/Mar/2005
gula
gumbarr
dhugandu
Gu206
Gu255
If we view this data using XML-aware software such as an XML editor
6
or a
web browser such as Mozilla Firefox or the current version
of MS Internet
Explorer, the hierarchical relationships between the data entities are dis-
played as in Figure 2.
Chapter 4 – Data and language documentation
103
Figure 2. XML structure display (lexicon entry)
For an annotated corpus we can set up a structure where:
1.
the corpus contains sentences;
2.
sentence properties are: sentence number, sentence form, sentence gloss,
speaker, recorder, sentence source reference, grammatical notes;
3.
sentences contain words in sequential order;
4.
word properties are: word form, word gloss;
5.
words contain morphemes in sequential order;
7
6.
morpheme properties are morpheme form, morpheme gloss, morpheme
category, morpheme subcategory.