Petran, Bollmann, Dipper, Klein
Character alignments (char) Finally there is a layer that aligns characters from the
annotated with the normalized forms. For instance, a word pair such as ‘chindelin’–
‘kindelîn’ (‘children’) gives rise to the mappings ch=k, i=i, n=n, d=d, e=e, l=l, i=î,
n=n. The mappings can be used to investigate spelling variation between different
dialect regions.
4 Semi-automatic annotation
Owing to the history of the corpus (cf. Sec. 2), the annotation process as a whole was
quite eclectic. The pioneering work on the Cologne corpus used a suite of programs
written in Macro SPITBOL for semi-automatic, rule-based part-of-speech and mor-
phology annotation (Klein, 1991). At the core of this suite is an annotated index of
normalized forms of Middle High German words based on the modernized tokenization.
The form to be annotated is analyzed with the known character alignments for
Middle High German spelling and dialectal variations and inflectional affixes. Based
on this analysis, a ranked list of approximate matches is returned from the normal
form index. The list has lemma and part-of-speech (POS) annotations, as well as a
pre-selection of possible morphology annotations for the recognized affixes. The index
already has rankings according to the naive probability of each suggestion; an additional
basic rule-based syntactic analysis re-ranks the suggestions appropriately for the token
context. A human annotator then selects the correct annotation from the list, or adds
the lemma to the index if the correct annotation was missing.
The opportunity for the annotator to add lemmas to the index ensured that the
index coverage grew as it was associated with more projects of wider scope. After the
annotation of the Cologne corpus, it was found to have a coverage of 90%, with the
correct annotation presented as first choice in 60% of the cases. Since the beginning of the
annotation efforts predates even standardized tagsets for modern German, customized
tagsets were originally used for parts of speech and morphology. They were later
mapped to HiTS tags (Dipper et al., 2013).
Annotating a sentence — example Table 1 shows part of the analysis for the beginning
of a sentence from the manuscript “Rheinisches Marienlob”, a poem in praise of the
Virgin Mary: ‘Wiſe Dine Burſte i¯n dinen lif. . .’ (‘Show your breasts [that have suckled
Jesus] and your body [that has born Jesus]. . .’).
The first token has four suggestions: The adjective (ADJ) ‘wîs(e)’ (‘wise’), the
feminine noun (F) ‘wîse’ (‘meadow’), the weak verb (SwV) ‘wîsen’ (‘to know, to show’),
and the adjective (ADJ) ‘wîz’ (‘white’). The system ranked the choices purely according
to their naive probabilities — no syntactic context has been encountered yet since this
is the beginning of the sentence. This means that the correct analysis, the weak verb,
is not ranked very highly in this case, and the annotation has to be corrected. The
correct analysis comes with a number of suggestions for the morphology. To generate
the suggestions, the inflectional paradigm of this verb was prefiltered according to the
inflectional affixes the system recognized. Again, the human annotator has to select
6
JLCL
ReM: A reference corpus of Middle High German
Form Lemma
POS
Morph
Wiſe
wîs(e)
ADJ
NP/-/0/NSmfnw/NASf/ASnw/NAP
wîse
F
NS/AS/GFS/NAP
wîsen
SwV
1SG/3SGK/1PG/2SGB/i
wîz
ADJ
NSmfnW/NASf/ASnw/NAP
dine
dîn
PronPoss NP/NSf/ASf/AP
burſte brust
F(u)
NP/AP/GP/GS/DS
i¯n
unde
Konj
–
dinen
dîn
PronPoss ASm/DP/DSm/DSn
lif
lîb
M
AS/NS/DS
loufen
stv7
3SVI/1SVI
Table 1: Lemma and annotation suggestions for the beginning of a sentence from “Rheinisches
Marienlob”. The leftmost column has the form as it was transcribed from the manuscript.
the correct analysis (2SGB, 2nd person imperative). The following tokens are largely
unambiguous, only the correct morphological analysis has to be manually selected here.
Table 2 shows the corrected annotation for this fragment.
Form Lemma
POS
Morph
Wiſe
wîsen
SwV
2SGB
dine
dîn
PronPoss AP
burſte brust
F(u)
AP
i¯n
unde
Konj
–
dinen
dîn
PronPoss ASm
lif
lîb
M
AS
Table 2: The manually corrected annotation.
The annotator has selected a weak verb (SwV) in 2nd singular imperative form
(2SGB) here, followed by a possessive pronoun (PronPoss) in accusative case and plural
number (AP), and so on. However, the annotations need to be converted into HiTS-like
tags, which have more categories (see Sec. 3) and more distinctions. This is not without
its own challenges, as Table 3 below shows.
Mapping to HiTS In some cases, such as for the first token, the mapping from the
internal tagset to HiTS is very straightforward. The internal tagset has the SwV POS
tag indicating a weak verb, and the 2SGB morphology tag for a second person singular
imperative form. This was re-distributed to a pos (token) tag for a full verb imperative
JLCL 2016 – Band 31 (2)
7