Petran, Bollmann, Dipper, Klein
Normalization (norm) This layer contains automatically-created word forms that
closely correspond to word forms as used in traditional editions of historical manuscripts
in German. For instance, a diplomatic form like ‘chindelin’ (‘children’) is mapped to
the form ‘kindelîn’.
Tokenization (tokenization) This layer annotates cases of diverging word boundaries,
as in Ex. (1). The annotation follows the HiTS guidelines (Dipper et al., 2013). The
tags encode two properties: first, whether the modernized form is a merger of several
historical forms to one modern form (Univerbierung, label U), or a case of splitting one
historical form to multiple modern ones (Multiverbierung, labels M..1, M..2, etc. for
the different forms). Second, the tags also encode which character is used at the word
boundary: a space (label S), a hyphen (H), or camel case, i.e. a word-internal capitalized
letter (Binnenmajuskel, B). It is also encoded if the tokenization involves a line break
(L). For some examples, see (2) (line breaks are marked by ‘C’).
(2)
a.
dipl
ſo
biſtu
‘so you are’
anno
so
bis
tu
tok
MS1
MS2
b.
dipl
Alſo
der
lichaname
er
ſtír
Cbet
‘as the body dies’
anno
Also
der
lichaname
erstirbet
tok
–
–
–
US UL
c.
dipl
be
durfeter
“you[pl] need”
anno
bedurfet
er
tok
US MS1
MS2
Punctuation (punc) This layer encodes original punctuation marks and modern sen-
tence and clause boundaries. Original punctuation marks correspond to modern sentence
or clause boundaries in about 2/3 of the cases.
Modern boundaries are always annotated at the last (modernized) word in the sentence
or clause. Labels used here are DE, EE, IE, QE, which stands for “end of a declarative
/ exclamative / imperative / interrogative clause”. Other segment boundaries that are
annotated include dependent and appositive clauses and enumerations (labels S*, N*,
NE).
Original punctuation marks that correspond to some segment boundary are annotated
with the tag $E, see (3).
(3)
dipl
ſo ne mach
i
nemen
gotegelichen
·
anno
so ne mach niemen gote gelichen
.
punc
DE
$E
‘so nobody can be like god’
4
JLCL
ReM: A reference corpus of Middle High German
Linguistic annotations: part of speech (pos), morphology (infl), lemma The original
annotations have been created semi-automatically (Klein, 2001). In the ReM corpus,
they have been mapped to tags that largely follow the HiTS guidelines (Dipper et al.,
2013). This means, among other things, that words are annotated in two ways, once as
a token (instance) and once as a type. The token annotation takes the actual context
into account, type annotation encodes general properties of a word. Ex. (4) shows
that the word ‘gebornen’ (‘born’) is basically a verb (past participle), which in this
context is used like an adjective. Hence, the type is annotated with the part of speech
“VVPP” (verb past participle), and the token is annotated with “ADJN” (postnominal
adjective).
(4)
dipl
diu
chindelin
niu
gebornen
anno
diu
chindelin
niu
gebornen
norm
diu
kindelîn
niu
geborenen
pos (token)
DDART
NA
ADJD
ADJN
pos (type)
DD
NA
ADJ
VVPP
lemma
der
kindelîn
niuwe
ge-bor(e)n
lemmaID
29817000
89652000
121830000
48162000
infl
Neut.Nom.Pl
Nom.Pl
Pos.Neut.Nom.Pl.0
–
inflClass
–
st.Neut
–
–
‘the newborn children’
In addition to the lemma, a lemma ID is also provided, which links to the corresponding
lemma of the online lexicon ‘Mittelhochdeutsches Wörterbuch’
5
.
In Ex. (4), the layer inflClass refers to the token-specific inflection class. It is specified
for nouns and verbs and represents the declension or conjugation class of the respective
lemmas, in the given context. In the case of nouns, a preceding article and/or adjective
can help in determining the gender of a noun (e.g. ‘Neut’). For instance, like many
other nouns in Middle High German, the lemma ‘slange’ (‘snake’) is underspecified for
gender and frequently occurs in masculine or feminine gender. Ex. (5) shows examples
where the context helps (a) or does not help (b) in disambiguating gender. The layer
infl-class (type) shows the general, ambiguous properties of the noun, the layer infl-class
(token) the context-specific features.
(5)
a.
dipl
So
der
hirz
den
ſlangen
ſihit
inflClass (token)
–
–
st.Masc
–
wk.Masc
–
inflClass (type)
–
–
st.Masc
–
wk.Masc,Fem
–
‘as the deer sees the snake’
b.
dipl
Vo
ſlange
inflClass (token)
–
wk.Masc,Fem
inflClass (type)
–
wk.Masc,Fem
‘of snakes’
5
http://www.mhdwb-online.de/lemmaliste/
JLCL 2016 – Band 31 (2)
5