Gepi
Figure 1: Examples of xml encodings in the corpus.
2 and 3). While contraction was especially important for named entities
and in particular biblical individuals and places (nomina sacra), for other
word classes other ways of abbreviation are found (abbreviation 4). We
annotated titles such as king of kings (named entities) in order to relate
inscription type to state organization and to better distinguish individuals
of the same name.
Throughout the corpus one sees that inscriptions are fragmentary, some
to the extent not to allow a full reconstruction of their texts. On average,
an inscription had roughly 33 words, 4 gaps and 11 abbreviations.
4
Experts
on language and inscriptions have been able to provide hypotheses about
the full text of many inscriptions. However, many gaps remain. Not only
for the already encoded inscriptions, but also for a planned extension to
the corpus some computer-aided assistance in the reconstruction could be
welcome. Since largely transcription and other epigraphic work is done in
digital environments already, this paper asks: Can there be a tool assisting
in reconstructing the complete texts of inscriptions? What will distinguish
4
Not all abbreviations are counted here since some cannot be read or are concealed in undeciphered
gaps.
JLCL 2016 – Band 31 (2)
27
Hoenen, Samushia
such a tool from the traditional methods and resources such as lexica of
abbreviations, lists of historical named entities and so forth.
2 Towards a Tool: Necessities
For reconstruction, a tool in the digital medium could be designed which
assists in two important excerises of the epigrapher: expanding abbrevia-
tions and filling gaps. For this purpose, the text of the inscription could be
represented digitally, where abbreviations and gaps could be marked and
filled with precomputed guesses. However, machine leraning and related
techniques have seemingly not yet been applied much to epigraphy, com-
pare Bodel (2012). Some studies on abbreviations and word prediction in
psycholinguistics may provide interesting and relevant insights eventhough
they are not replicating the epigraphic context, see for instance Yang et al.
(2009); McWilliam et al. (2009); Slattery et al. (2011); Taylor (1953).
In computation, for tasks similar to epigraphic reconstruction such as
abbreviation generation, sequence prediction and spelling correction feasible
solutions have been found. But, those often rely on pretrained statistical
models which need large amounts of input data. An example is the appli-
cation of ngram language models for sequence prediction, where Manning
and Schütze (1999) note that ngram models to be effective usually need
large amounts of training data.
5
Even the full amount of Old Georgian
data digitally available
6
is still not large enough to perform and thoroughly
evaluate the majority of such approaches. Methodologically, there is an
additional factor complicating assessment: Any gap can hold any number
of abbreviations making gap filler generation (GFG) a more complex task
than simple sequence prediction or abbreviation generation.
Additionally, the epigraphic record is very heterogeneous with the easier
cases often already manually solved. In order to exemplify the heterogeneity
of the epigraphic record and thus the range a tool aiding in reconstruction
has to be able to address, we give some examples from the Georgian
inscriptions, images come from the Corpus of Old Georgian Inscriptions.
7
On the one end there are inscriptions with so fragmentary evidence that no
super computer can probably ever help to decipher the message, on the
other there are reconstructions as trivial as to be performed without much
effort even by laymen correctly.
5
See chapter 6 for discussion, (Manning and Schütze, 1999, p.201): ”In general, four gram models
do not become usable until one is training on several tens of millions of words of data.”
6
A subcorpus on Old Georgian not containing different redactions of the same texts from the TITUS
server comprises roughly 4 million words.
7
Collection under: http://titus.fkidg1.uni-frankfurt.de/texte/etcg/cauc/ageo/inscr/carcera/
carce.htm
28
JLCL
Gepi
Although some letters survived, the extent and the placement of the gaps make a
complete reconstruction almost impossible.
Here, the broken off part to the left can be reconstructed with a good level of
confidence, since each line has more surviving than missing letters and since the
amount of missing letters is of a minor magnitude. Also, there are few abbreviations.
Finally, in this example, only abbreviations of moderate difficulty have to be
expanded, which could be done by a beginner to epigraphy knowing Old Georgian.
JLCL 2016 – Band 31 (2)
29
Hoenen, Samushia
Facing such variety and difficulties, rather than provide a completed
feasible solution for GFG, which given the scope of this article and the current
landscape of computational epigraphic reconstruction would seem unrealistic,
this paper primarily aims at making computational scholars aware of the
inseparable interplay of abbreviation and gap which so characterizes the
epigraphic record in many epochs, regions and languages and which may
represent a new computational challenge. Towards a proof of concept
however, a basic method for GFG is being formulated and tested.
In order to demonstrate the utility of such a tool, we concentrate on
examples promising to yield some useful results. This is why we restrict
ourselves for the time being to single words if possible not on broken off
edges, the extent of which is unclear. We argue that if we are able to
provide useful guesses for these, then larger units might be in reach for
future research.
3 Method
We are looking for lexical matches of the gap context comparing two meth-
ods, pure frequency based cues and word embedding based cues. In the
face of formulas and a very standardized language of inscriptions pure
frequencies and conditional frequencies (of a word given a predecessor or
follower) may be a sufficiently strong cue and could be feasible as a base-
line. Word embeddings Mikolov et al. (2013a) on the other hand can be
used for sequence prediciton since their training includes an optimization
of immediate contextual similarity. To this end, semantic and syntactic
similarity are captured by word embeddings which makes them a possible
cue for a gap filler. Furthermore, Mikolov et al. (2013b) state that ”neural
network based language models significantly outperform N-gram models”,
compare Bengio et al. (2003); Mikolov et al. (2011); Schwenk (2007). In
fact, in a pre-experiment, we found an ordinary n-gram language model to
perform well only for the preditcion of the content of very short gaps. In
order to generate word embeddings, we can later use for GFG, we compiled
yet another corpus of Old Georgian texts fromt the TITUS archive.
3.1 Corpus
The TITUS website provides texts for many ancient languages and is (one
of) the most comprehensive archive(s) (in close collaboration with the GNC)
for Old Georgian text. Among the texts present are the Bible, lectionaries,
hagiographical, theological and apocryphal texts, psalms and odes, song,
historical texts, homiletic and exegetic texts, liturgical texts, canonical law
texts, philosophical texts and for instance an astrological and a grammatical
30
JLCL
Dostları ilə paylaş: |