ReM: A reference corpus of Middle High German
Figure 5: Diplomatic full text view of the Middle High German translation of Alkuin’s “De virtutibus
et vitiis”
The layout elements are then placed via CSS in a way that resembles the manuscript:
the larger box represents a folio page, with the left and right side representing the back
and front sides of the manuscript page. If the manuscript has multiple columns, they
are placed next to each other. The text is rendered in a Unicode version that mirrors
the original. The yellow tint provides a visual clue that the text presentation is oriented
towards the original.
The modernized view is based on the simplified transcription and modern tokenization.
It provides a quick way of accessing larger contexts, and, since it does not imitate the
original layout, the opportunity to fit the text to varying screen sizes. Fig. 6 shows part
of the modernized view of the same text.
Since the corpus in its current form only annotates boundary locations (see Sec. 3),
and not the entire sentence spans, there is no structuring information that can be used
by ANNIS’ full text view. As the absence of any structuring would hinder readability,
especially for longer texts, we used the pages and columns from the dipl structure to
emit paragraph (p) elements which contain all tok_anno as spans. Unfortunately, this
leads to paragraphs sometimes breaking up sentences, since they orient towards the
layout. However, since the modernized view consists only of variable size elements, it
can be easily adapted to different screen sizes and browser window sizes, as can be seen
from the downscaled browser window.
JLCL 2016 – Band 31 (2)
13
Petran, Bollmann, Dipper, Klein
Figure 6: Modernized full text view of the document displayed in Fig. 5.
7 Conclusion
We presented the creation of the Reference Corpus Middle High German (ReM) with a
focus on the compilation and annotation process and its implications for the preparation
and release of the corpus.
The ReM corpus is a product of several annotation efforts stretching over the span of
about 30 years, and starting as far back as 1986 (cf. Sec. 2). This explains the usage of
annotation tools, formats, and tagsets that would be considered “out-dated” from a
modern point of view. We discussed the types of annotation in the final corpus and
how they were derived from the originally annotated data; e.g., creating two distinct
tokenization layers (“diplomatic” and “annotated”/“modernized”) from word boundary
markings in the transcription, or mapping the custom part-of-speech tagset to the
modern HiTS tagset (cf. Secs. 3 and 4).
By converting the corpus into an XML format (Sec. 5), we hope to make it more
accessible for existing tools and computational analyses. Providing access to the corpus
via the ANNIS tool (Sec. 6), on the other hand, provides an efficient way for querying
and visualizing the corpus data.
Acknowledgments
We would like to thank the German Research Foundation (Deutsche Forschungsgemein-
schaft) for financial support, Grants DI 1558/1, KL 472/6, WE 1318/14, WI 3664/2.
14
JLCL
ReM: A reference corpus of Middle High German
References
Bollmann, M., Petran, F., Dipper, S., and Krasselt, J. (2014). CorA: a web-based annotation
tool for historical and other non-standard language data. In Proceedings of the 8th Workshop
on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH),
pages 86–90, Gothenburg, Sweden.
Dipper, S., Donhauser, K., Klein, T., Linde, S., Müller, S., and Wegera, K.-P. (2013). HiTS:
ein Tagset für historische Sprachstufen des Deutschen. Journal for Language Technology
and Computational Linguistics, Special Issue, 28(1):85–137.
Dipper, S. and Schultz-Balluff, S. (2013). The Anselm corpus: Methods and perspectives of
a parallel aligned corpus. In Proceedings of the NODALIDA Workshop on Computational
Historical Linguistics.
Klein, T. (1991). Zur Frage der Korpusbildung und zur computergestützten grammatischen
Auswertung mittelhochdeutscher Quellen. In Wegera, K.-P., editor, Mittelhochdeutsche
Grammatik als Aufgabe, pages 3–23. E. Schmidt, Berlin.
Klein, T. (2001). Vom lemmatisierten Index zur Grammatik. In Moser, S., Stahl, P., Wegstein,
W., and Wolf, N. R., editors, Maschinelle Verarbeitung altdeutscher Texte V. Beiträge zum
Fünften Internationalen Symposion, Würzburg 4.-6. März 1997, pages 83–103. de Gruyter,
Berlin.
Klein, T. and Bumke, J. (1997).
Wortindex zu hessisch-thüringischen Epen um 1200. Niemeyer,
Tübingen. Unter Mitarbeit von B. Kronsfoth und A. Mielke-Vandenhouten.
Klein, T. and Dipper, S. (2016). Handbuch zum Referenzkorpus Mittelhochdeutsch. Bochumer
Linguistische Arbeitsberichte, 19.
Klein, T., Solms, H.-J., and Wegera, K.-P., editors (2009). Mittelhochdeutsche Grammatik.
Teil III: Wortbildung. Niemeyer, Tübingen.
Krause, T. and Zeldes, A. (2016). ANNIS3: A new architecture for generic corpus query and vi-
sualization. Digital Scholarship in the Humanities, 31:118–139. http://dsh.oxfordjournals.
org/content/31/1/118.
Wegera, K.-P. (2000). Grundlagenprobleme einer neuen mittelhochdeutschen Grammatik. In
Besch, W., Betten, A., Reichmann, O., and Sonderegger, S., editors, Sprachgeschichte. Ein
Handbuch zur Geschichte der deutschen Sprache und ihrer Erforschung, volume 2, pages
1304–1320. de Gruyter, Berlin, New York, 2nd edition.
JLCL 2016 – Band 31 (2)
15