Petran, Bollmann, Dipper, Klein
Token
Wise
dine
burste
in
dinen
lif
pos (token)
VVIMP
DPOSA
NA
KON
DPOSA
NA
pos (type)
VV
DPOS
NA
KO
DPOS
NA
infl
Sg.2
Fem.Akk.Pl.st
Akk.Pl
–
Masc.Akk.Sg.st
Akk.Sg
inflClass
wk
–
st(u).Fem
–
–
st.Masc
Table 3:
Final annotations for this fragment. Lemma and other annotations are omitted here, but
are visible in the final corpus. The tokens are shown in simplified spelling.
(VVIMP), a pos (type) tag for a full verb, infl showing only 2nd person singular form,
and an inflClass tag showing the weak inflection class. The second token is annotated
as a possessive determinative that precedes its noun phrase (DPOSA). This is not
explicitly annotated in the internal tagset, but it can be easily inferred by precedence
being the default case for determinatives.
Difficulties arise in cases where HiTS makes distinctions that are not made in the
internal tagset. For example, the noun ‘burſte’ (‘breast’) is annotated as belonging
to the strong inflection class in HiTS, but the internal tagset does not capture this
information. This had to be solved by a combination of the analysis of the lemma form
and list lookup: if the lemma ends in a consonant, the noun has a strong inflection
class. Lemmas ending in ‘-e’ have to be looked up for weak or strong inflection classes.
Lemmas ending in other vowels are always weakly inflected. Similar lists had to be
built for other parts of speech that lacked disctinctions, such as pronouns, articles and
numerals, as well as verbs, auxiliar verbs, and modal verbs.
Some distinctions could not be reconstructed by looking at the token alone. One
example for this is the annotation of pronominal adverbs that introduce a relative clause
(as opposed to interrogative usage) as PAVREL. Reconstructing these distinctions would
have required usage of the syntactic context which the tools are not capable of. In that
sense, the tagset used here represents a subset of the entire HiTS tagset.
The final output of this annotation process is a flat XML file based on the modernized
tokenization only; the historical tokenization has to be inferred using the transcription
standards (see Sec. 3). It is converted into CorA-XML format (see Sec. 5) to re-gain
flexibility with regards to the tokenization layers.
5 CorA-XML document format
For further processing of the annotated data, we choose to convert it into the CorA-XML
document format. This XML-based format was originally developed for the web-based
annotation tool CorA
6
(Bollmann et al., 2014), and is specifically designed for the
needs of historical documents. CorA is actively used to annotate historical texts for
the reference corpora of Early New High German (ReF) and Middle Low German/Low
Rhenish (ReN), as well as the Anselm corpus of Early New High German (Dipper
and Schultz-Balluff, 2013). Converting ReM to the same format therefore significantly
6
https://www.linguistics.rub.de/comphist/resources/cora/
8
JLCL
ReM: A reference corpus of Middle High German
Figure 1: Simplified CorA-XML representation
of “ſo biſtu” with annotations
. . .
. . .
. . .
ſo
biſtu
. . .
. . .
. . .
. . .
Figure 2: Simplified example of the layout hi-
erarchy in CorA-XML
increases reusability of tools and facilitates further processing of the data. Furthermore,
we are actively working on an automatic conversion from CorA-XML to a TEI-compatible
format, which will open up the data for use with an even wider range of existing tools.
CorA-XML distinguishes between two different tokenization layers, whose elements
are represented by and tags respectively, corresponding to the distinction
between diplomatic and annotated tokens in ReM (cf. Sec. 3). Since there can be a one-
to-many (or even many-to-many) relationship between elements of these layers (as in the
‘biſtu’ example from Fig. 4 below), they are always wrapped by a virtual element
which establishes this correspondence. Within each layer, different representations of the
wordforms can be included, e.g., a UTF-8 representation conserving special characters
(such as ‘ſ’), or a pure ASCII representation (mapping ‘ſ’ to ‘s’). On the annotated
tokenization layer, arbitrary annotations can be added to each token, encoding the
linguistic layer and punctuation layer described in Sec. 3. Figure 1 gives a simplified
example of the CorA-XML representation for the sequence ‘ſo biſtu’ from Figure 4.
Layout information is encoded via a hierarchy of layout elements, namely ‘pages’,
‘columns’, and ‘lines’. Each instance of an element contains a pointer to one or more
elements of the next lower type in the hierarchy; i.e., pages refer to columns, which in
turn refer to lines. Each ‘line’ element finally refers to one or more diplomatic tokens.
Figure 2 provides an example visualization of this hierarchy. A valid layout specification
in a CorA-XML document requires that each diplomatic token is contained in the span
of exactly one ‘line’ element, thereby allowing to derive an exact page, column, and line
specification for each diplomatic token.
JLCL 2016 – Band 31 (2)
9