Journal for Language Technology and Computational Linguistics

Yüklə 3,56 Mb.

Pdf görüntüsü

səhifə	4/14
tarix	22.07.2018
ölçüsü	3,56 Mb.
	#57639

1 2 3 4 5 6 7 8 9 ... 14

Petran, Bollmann, Dipper, Klein

Character alignments (char) Finally there is a layer that aligns characters from the

annotated with the normalized forms. For instance, a word pair such as ‘chindelin’–

‘kindelîn’ (‘children’) gives rise to the mappings ch=k, i=i, n=n, d=d, e=e, l=l, i=î,

n=n. The mappings can be used to investigate spelling variation between diﬀerent

dialect regions.

4 Semi-automatic annotation

Owing to the history of the corpus (cf. Sec. 2), the annotation process as a whole was

quite eclectic. The pioneering work on the Cologne corpus used a suite of programs

written in Macro SPITBOL for semi-automatic, rule-based part-of-speech and mor-

phology annotation (Klein, 1991). At the core of this suite is an annotated index of

normalized forms of Middle High German words based on the modernized tokenization.

The form to be annotated is analyzed with the known character alignments for

Middle High German spelling and dialectal variations and inﬂectional aﬃxes. Based

on this analysis, a ranked list of approximate matches is returned from the normal

form index. The list has lemma and part-of-speech (POS) annotations, as well as a

pre-selection of possible morphology annotations for the recognized aﬃxes. The index

already has rankings according to the naive probability of each suggestion; an additional

basic rule-based syntactic analysis re-ranks the suggestions appropriately for the token

context. A human annotator then selects the correct annotation from the list, or adds

the lemma to the index if the correct annotation was missing.

The opportunity for the annotator to add lemmas to the index ensured that the

index coverage grew as it was associated with more projects of wider scope. After the

annotation of the Cologne corpus, it was found to have a coverage of 90%, with the

correct annotation presented as ﬁrst choice in 60% of the cases. Since the beginning of the

annotation eﬀorts predates even standardized tagsets for modern German, customized

tagsets were originally used for parts of speech and morphology. They were later

mapped to HiTS tags (Dipper et al., 2013).

Annotating a sentence — example Table 1 shows part of the analysis for the beginning

of a sentence from the manuscript “Rheinisches Marienlob”, a poem in praise of the

Virgin Mary: ‘Wiſe Dine Burſte i¯n dinen lif. . .’ (‘Show your breasts [that have suckled

Jesus] and your body [that has born Jesus]. . .’).

The ﬁrst token has four suggestions: The adjective (ADJ) ‘wîs(e)’ (‘wise’), the

feminine noun (F) ‘wîse’ (‘meadow’), the weak verb (SwV) ‘wîsen’ (‘to know, to show’),

and the adjective (ADJ) ‘wîz’ (‘white’). The system ranked the choices purely according

to their naive probabilities — no syntactic context has been encountered yet since this

is the beginning of the sentence. This means that the correct analysis, the weak verb,

is not ranked very highly in this case, and the annotation has to be corrected. The

correct analysis comes with a number of suggestions for the morphology. To generate

the suggestions, the inﬂectional paradigm of this verb was preﬁltered according to the

inﬂectional aﬃxes the system recognized. Again, the human annotator has to select

6

JLCL

ReM: A reference corpus of Middle High German

Form Lemma

POS

Morph

Wiſe

wîs(e)

ADJ

NP/-/0/NSmfnw/NASf/ASnw/NAP

wîse

NS/AS/GFS/NAP

wîsen

SwV

1SG/3SGK/1PG/2SGB/i

wîz

ADJ

NSmfnW/NASf/ASnw/NAP

dine

dîn

PronPoss NP/NSf/ASf/AP

burſte brust

F(u)

NP/AP/GP/GS/DS

i¯n

unde

Konj

–

dinen

dîn

PronPoss ASm/DP/DSm/DSn

lif

lîb

AS/NS/DS

loufen

stv7

3SVI/1SVI

Table 1: Lemma and annotation suggestions for the beginning of a sentence from “Rheinisches

Marienlob”. The leftmost column has the form as it was transcribed from the manuscript.

the correct analysis (2SGB, 2nd person imperative). The following tokens are largely

unambiguous, only the correct morphological analysis has to be manually selected here.

Table 2 shows the corrected annotation for this fragment.

Form Lemma

POS

Morph

Wiſe

wîsen

SwV

2SGB

dine

dîn

PronPoss AP

burſte brust

F(u)

i¯n

unde

Konj

–

dinen

dîn

PronPoss ASm

lif

lîb

AS

Table 2: The manually corrected annotation.

The annotator has selected a weak verb (SwV) in 2nd singular imperative form

(2SGB) here, followed by a possessive pronoun (PronPoss) in accusative case and plural

number (AP), and so on. However, the annotations need to be converted into HiTS-like

tags, which have more categories (see Sec. 3) and more distinctions. This is not without

its own challenges, as Table 3 below shows.

Mapping to HiTS In some cases, such as for the ﬁrst token, the mapping from the

internal tagset to HiTS is very straightforward. The internal tagset has the SwV POS

tag indicating a weak verb, and the 2SGB morphology tag for a second person singular

imperative form. This was re-distributed to a pos (token) tag for a full verb imperative

JLCL 2016 – Band 31 (2)

7

Yüklə 3,56 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 14