Essentials of Language Documentation

Yüklə 5,72 Mb.

Pdf görüntüsü

səhifə	40/144
tarix	22.07.2018
ölçüsü	5,72 Mb.
	#57633

1 ... 36 37 38 39 40 41 42 43 ... 144

3. Processing the materials

Chapter 4 – Data and language documentation

notes1.doc in another directory (e.g. for your 2005 notes) then any loss of

directory information will result in confusion between these files. Different

naming schemes can be used, but clarity and transparency is the goal – see

Johnson (2004) for some suggestions. It is also essential to record the rele-

vant metadata for the data files you create as you make them, ideally in a

structured way such as a relational table using standard terminology.

3. Processing the materials

3.1. Linguistic processing

Processing the documentary materials is a very different operation from

recording and capture, and operates on a very different time scale. Thus

each minute of audio can take hours to process in terms of transcription and

annotation (depending on familiarity with the language and the richness of

the annotation), while video is even more labor-intensive and requires

much more time to process. Video may require cutting and converting to

create manageable chunks and file sizes (this is done with computer soft-

ware

2

). There are several tools that are useful for transcription and annota-

tion (see below).

Linguistic analysis, that is transcription, translation, and annotation,

requires decisions about representation, i.e. the levels and types of units.

This should make sense within the researcher’s chosen framework (theory)

and needs to be made clear in the structural metadata that accompanies the

relevant files.

There are good reasons for aiming at a certain degree of standardization

when processing the materials, including transparency, portability, and ease

of sharing and access (Bird and Simons 2003). Phonetic transcription

should follow the conventions of the International Phonetic Association

(IPA), and phonemic transcription should be IPA or a regionally-recog-

nized standard. Grammatical annotation tags (i.e. the abbreviated labels for

e.g. part of speech categories) should follow general linguistic practice, e.g.

the recommendations of EUROTYP or E-MELD (including its GOLD on-

tology), with a list of relevant abbreviations and symbols provided as meta-

data (for further discussion, see Chapter 9 and Leech and Wilson 1996).

96

Peter K. Austin

For processed data we need to distinguish between the following:

Character encoding – how characters are represented, e.g. Windows /

ANSI, Unicode, UTF-8, Big5, JISC.

Data encoding – how meaningful structures in the data are marked, e.g.

extensible markup language (XML), Shoebox/Toolbox standard markers,

MS Word table.

File encoding – how the data is packaged into a digital file, e.g. plain

text, MS Word, PDF, Excel spreadsheet.

Physical storage medium – the physical form used to store the file, e.g.

CD-ROM, minidisk, DAT, hard disk, flash memory stick.

As an example, certain documentary materials might be encoded as a hard

disk file in plain text Unicode Toolbox format (for further discussion and

examples, see Chapter 14).

When we consider file encoding it is useful to distinguish between pro-

prietary formats and non-proprietary formats. A proprietary format is one

whose structure is determined and owned by the maker of the software that

stores it, e.g. MS Word, Excel, Access, FileMaker Pro, or Sony ATRAC

(the audio format on minidisk). As such, this means that the data is not di-

rectly accessible, and the format is subject to change (so that attempting to

open a file stored in one version of the software with a later version may

not always work – see Chapter 14 for examples). As a result, proprietary

formats are not ideal for long-term storage (i.e. the encoding is not portable

and reusable). Non-proprietary formats, e.g. Unicode plain text, or wav

audio, are open and transferrable between hardware and software.

When processing the data it can be useful to distinguish three kinds of

contexts each requiring different data formats (see also Johnson 2004):

working context – the way the data is stored for on-going research work

of annotation and analysis;

archiving context – how the materials are to be stored for long-term

preservation (see below);

presentation context – the form of the data for distribution and publica-

tion.

Researchers need to develop ways to flow data between contexts, typically

by exporting the data into some structured format that the software used for

other contexts can read (see Thieberger 2004 for some examples). Thus, a

common working format for text annotation is Shoebox/Toolbox; this can

Chapter 4 – Data and language documentation

be exported into rich text format (RTF) to be read by MS Word in order to

produce presentation format PDF documents for printing and distribution.

Table 2 gives examples of the different format types for the three contexts.

Table 2. Data formats in different contexts

Working

Archiving Presentation

Text

Word, XLS, FMpro,

Shoebox/Toolbox

XML

PDF, HTML

Audio WAV

WAV,

BWF MP3, WMA, RA

Video MPEG2

MPEG2,

MPEG4

QuickTime, AVI, WMV

As an illustration, Figure 1 is a screen shot which shows Shoebox format

working context data for the Australian Aboriginal Guwamu language.

the window on the top left is lexical information, on the lower left is elic-

ited sentence data with morpheme-by-morpheme glossing annotation and

free translation, on the top right is descriptive metadata about the people

involved in the project, and on the bottom right metadata about abbrevia-

tions used in the lexical and sentence annotations. Note that the metadata is

hypertextually linked to the data in the two left-hand windows, while the

lexical root is hypertextually linked from the morpheme field in the sen-

tence window, and the sentence number links from the example field in the

lexicon.

A possible presentation form of the illustrated lexical entry is the following:

bawurra n

male red kangaroo, Note: used as a generic

term for kangaroos, cf. gula, gumbarr,

dhugandu, [SAW, WW], e.g. Gu206, Gu255

Yüklə 5,72 Mb.

Dostları ilə paylaş:

1 ... 36 37 38 39 40 41 42 43 ... 144