Petran, Bollmann, Dipper, Klein
6 Access via ANNIS
For the public release of the corpus, it was important that different user groups’ needs
can be satisified by a single visualization and search system. Users should be able to
make (diachronically oriented) queries that disregard variation such as different use
of diacritics, usage of long or normal ‘s’, or tokenization peculiarities. At the same
time, the transcription captures all such variation, so it was important to make them
available as well for users that want to research those aspects of our texts. The corpus
tool ANNIS
7
(Krause and Zeldes, 2016) addresses needs such as ours, by specifically
targeting the visualization of complex, multi-layer corpora. It also offers Pepper
8
, a
modular conversion infrastructure that can be leveraged to convert a number of different
formats into ANNIS native format for easy import. Since it did not originally recognize
Cora-XML, we developed an import module for it which is now included in the Pepper
distribution.
In spite of its flexibility, there are a number of technical and conceptual limitations.
For technical reasons, there is a limit on the size of a corpus that can be imported into
ANNIS. The exact limit depends on the number and nature of the annotations, in our
case it amounts to around 60,000 tokens. We solved this by dividing the texts into
smaller subcorpora. Since no single criterion provided a subdivision of appropriate size
for all of their values, we used a combination of several criteria. The first subdivision
is by the century, or half-century where the texts most likely originated, such as 11-1
for the first half of the 11th century (1000–1050). All centuries are further divided
into more or less broad dialect areas, such as alem for Alemannic. Most dialects are
attested well enough to warrant further subdivision into prose (P), verse (V), and
charter (U – “Urkunde”) texts. Finally, a suffix marks if the texts are from the original,
balanced grammar corpus (G) or from the extension (X). In this way, the subcorpus list
also allows for a quick access to some of the meta annotations. The texts are further
annotated with more exact and specific meta annotations that are also searchable
(Fig. 3).
Displaying annotations in ANNIS For the display of annotations, we chose the grid
view, which is essentially a table with flexible column sizes. It fits the structure of our
annotations, which are of two distinct categories. Linguistic annotations, such as parts
of speech or lemma, relate to word tokens in modernized tokenization. Layout related
information, such as page or line breaks, which is also treated as annotation by ANNIS
on the other hand, relates to historical tokenization (see Sec. 3). Users have to be
able to query for layout specific information in their searches, yet displaying all layout
information in the grid would visually clutter the results. We therefore combined all
layout information on the line level, while the specific higher levels are still searchable,
but will not be displayed in the results. The names for the annotation categories were
7
http://corpus-tools.org/annis/
8
http://corpus-tools.org/pepper/
10
JLCL
ReM: A reference corpus of Middle High German
Figure 3: Part of the meta annotations for the text “Augensegen” (“blessing of the eyes”). Some of
the meta annotations are important for diachronic searches, others (such as the annotators
responsible for digitization) are merely informative.
chosen for consistency with other existent reference corpus projects where possible (see
Sec. 1).
On the conceptual level, ANNIS default configurations assume a single, main token
layer. However, in our case the simple surface token form already exists in two annotation
dimensions: transcription (diplomatic or simplified), and tokenization (historical or
modern). Displaying each possible combination would clutter the results more than
it would help, so we chose only two token forms for the primary text: tok_anno and
tok_dipl. tok_anno combines the modern tokenization with simplified spelling, while
tok_dipl combines historical tokenization with diplomatic spelling. These two token
variations make up the primary text and can be selected to be displayed in the KWIC
view of the primary search results. Fig. 4 shows such a result for the search for the
sequence “bis tu” in modernized form.
Each search result is shown in KWIC format with the currently selected tokenization
layer, the main layer can be switched between tok_dipl and tok_anno via the menu on
the top.
9
Below the main token is an expandable grid table displaying the annotations.
It starts on top with layout information (“66a,2b”). Layers tok_dipl and tok_anno
contain the two textual versions, followed by the layers with linguistic information. The
layer norm contains the normalized form that closely corresponds to word forms as used
in Middle High German dictionaries (see Sec. 3). Layer tokenization contains the
information on the difference between modernized and historical tokenization. Layers
9
The menu also shows the default token layer, which is empty, as it was only used to align the two
tokenization layers.
JLCL 2016 – Band 31 (2)
11
Petran, Bollmann, Dipper, Klein
Figure 4: ANNIS window showing the results of a search for the sequence “bis tu” in modernized
form. Part of the subcorpus list is shown on the lower left.
pos and posLemma correspond to the part of speech of the token and type respectively
(see Sec. 3), as do the layers inflectionClass and inflectionClassLemma. Layer punc
at the bottom encodes information on punctuation marks and segmentation.
Full text view The different user groups’ needs are also taken into consideration for
the full text view. While ANNIS has a default full text view, it does not work with our
corpora, since it presumes a single main token layer. Instead, we used a functionality
that allows a full text view to be generated as an HTML document by emitting any
annotation as HTML elements, which can then be styled with CSS, thus making it
adaptable for both diplomatic and modernized views.
A diplomatic view provides a version of the document that is as close to the original
manuscript as possible. It displays all letter variation, diacritics, layout, and tokenization
unchanged, and can be used as a more readable version of the original for many purposes.
The layout levels are emitted as nested div elements, with the final line divs containing
the tok_dipl as spans. Fig. 5 shows part of the diplomatic view for a text.
12
JLCL