Armin Hoenen, Lela Samushia
Gepi: An Epigraphic Corpus for Old Georgian and a Tool Sketch
for Aiding Reconstruction
In the current paper, an annotated corpus of Old Georgian inscriptions is
introduced. The corpus contains 91 inscriptions which have been annotated
in the standard epigraphic XML format EpiDoc, part of the TEI. Secondly,
a prototype tool for helping epigraphic reconstruction is designed based
on the inherent needs of epigraphy. The prototype backend uses word
embeddings and frequencies generated from a corpus of Old Georgian to
determine possible gap fillers. The method is applied to the gaps in the
corpus and generates promising results. A sketch of a front end is being
designed.
1 The Old Georgian Corpus
Basis for the corpus are the transcriptions present on the TITUS web
thesaurus, Gippert (1995).
1
91 inscriptions have been transcribed into
digital form and annotated. The corpus comprises Old Georgian inscriptions
with the oldest dated to the 5
th
century A.D. written in Old Georgian
Majuscle (Asomtavruli). However, some of the inscriptions stem from the
new Georgian period and are written in the modern version of the alphabet
(Mxedruli). The majority of inscriptions are building inscriptions (churches),
yet there are some gravestone inscriptions and inscribed crosses and other
objects. Of special importance for regional and national history are people
mentioned mostly on gravestones and correlated data from the inscriptions.
As Georgian has been written in three alphabets throughout its history, all
inscriptions have been transcribed into the modern version of the alphabet
in previous projects.
1.1 Corpus generation
Whilst the corpus is online and accessible via the Titus archive, a transla-
tion and annotations have been added. Additionally, the corpus has been
transformed into the TEI-format in a way conforming to EpiDoc guidelines.
EpiDoc, according to their website
2
is ”an international, collaborative effort
that provides guidelines and tools for encoding scholarly and educational
1
http://titus.fkidg1.uni-frankfurt.de/texte/etcg/cauc/ageo/inscr/carcera/carce.htm
2
https://sourceforge.net/p/epidoc/wiki/About/:last accessed on 07.02.2017
JLCL 2016 – Band 31 (2) – 25-38
Hoenen, Samushia
editions of ancient documents” which originated from an effort for publica-
tion of ancient inscriptions. In technical terms it uses a subset of the TEI.
EpiDoc provides guidelines for the encoding of ancient documents, which
the Old Georgian Corpus follows.
Each inscription is encoded in its own tei-xml file in order to ensure
complete informativity on metadata and textual levels. The header contains
meta information such as language, alphabet, place and time of the inscrip-
tion as well as a link to its images if available on the TITUS web thesaurus
which hosts the inscriptions electronically prepared at the National Museum
of Georgia for the Georgian National Corpus (GNC), which they are part
of.
3
The body of the document contains the four text divisions typical for
EpiDoc: edition, translation, commentary and bibliography.
Annotations are applied to the text in the modern transcription. This
transcription forming the TITUS base text previously already included
expansions of abbreviations, fillers of gaps, most probable readings of unclear
letters, letters the scribe had omitted and so forth (the canon of epigraphic
annotation). The modern trancription thus displays one reconstructed
text version for the inscription (where reconstruction was possible) and is
consequently stored in the text division edition. Besides, each file provides
the original characters (similar to the text in majuscules in Latin) preserving
original linebreaks. Alongside, in a separate text division a full English
translation is provided, which has been newly compiled and added to the
corpus. People, titles, places and dates have been annotated in order to
enable semantic technologies at later stages. Named entity annotation is
encoded through the tag named term specified by its attributes type and
subtype.
Figure 1 illustrates some of the mentioned encodings. The Georgian
abbreviation tradition is especially complex and features many models, see
Boeder (1987). Contraction, the mode of abbreviating by first and last letter
which gained prominence in the Christian era, compare for instance Driscoll
(2009) was very prominent in Old Georgian (abbreviation 1). According to
(Danelia and Sarzhveladze, 2012, p.312), the following types of abbreviation
are available in Old Georgian: the abbreviation of a word to its initial letter,
suspension, contraction and elision of vowels. Suspension is very rare and
only found on epigraphic monuments (it is not evidenced in manuscripts).
Unlike manuscripts, in epigraphy often uncommon, unfamiliar abbreviations
are present, which are difficult to decipher. When it came to suffixes, in Old
Georgian affix chains are quite common. In order not to lose the meaning,
the suffixes had to be encoded in the abbreviation and scribes may have
had different opinions (apart from different spatial considerations) on how
to extend the contraction principle consistently in this case (abbreviation
3
http://titus.uni-frankfurt.de/indexe.htm: last accessed on 10.02.2017, http://gnc.gov.ge/gnc/
static/portal/gnc.html,http://museum.ge
26
JLCL