Journal for Language Technology and Computational Linguistics

Yüklə 3,56 Mb.

Pdf görüntüsü

səhifə	13/14
tarix	22.07.2018
ölçüsü	3,56 Mb.
	#57639

1 ... 6 7 8 9 10 11 12 13 14

Gepi

Figure 1: Examples of xml encodings in the corpus.

2 and 3). While contraction was especially important for named entities

and in particular biblical individuals and places (nomina sacra), for other

word classes other ways of abbreviation are found (abbreviation 4). We

annotated titles such as king of kings (named entities) in order to relate

inscription type to state organization and to better distinguish individuals

of the same name.

Throughout the corpus one sees that inscriptions are fragmentary, some

to the extent not to allow a full reconstruction of their texts. On average,

an inscription had roughly 33 words, 4 gaps and 11 abbreviations.

Experts

on language and inscriptions have been able to provide hypotheses about

the full text of many inscriptions. However, many gaps remain. Not only

for the already encoded inscriptions, but also for a planned extension to

the corpus some computer-aided assistance in the reconstruction could be

welcome. Since largely transcription and other epigraphic work is done in

digital environments already, this paper asks: Can there be a tool assisting

in reconstructing the complete texts of inscriptions? What will distinguish

Not all abbreviations are counted here since some cannot be read or are concealed in undeciphered

gaps.

JLCL 2016 – Band 31 (2)

27

Hoenen, Samushia

such a tool from the traditional methods and resources such as lexica of

abbreviations, lists of historical named entities and so forth.

2 Towards a Tool: Necessities

For reconstruction, a tool in the digital medium could be designed which

assists in two important excerises of the epigrapher: expanding abbrevia-

tions and ﬁlling gaps. For this purpose, the text of the inscription could be

represented digitally, where abbreviations and gaps could be marked and

ﬁlled with precomputed guesses. However, machine leraning and related

techniques have seemingly not yet been applied much to epigraphy, com-

pare Bodel (2012). Some studies on abbreviations and word prediction in

psycholinguistics may provide interesting and relevant insights eventhough

they are not replicating the epigraphic context, see for instance Yang et al.

(2009); McWilliam et al. (2009); Slattery et al. (2011); Taylor (1953).

In computation, for tasks similar to epigraphic reconstruction such as

abbreviation generation, sequence prediction and spelling correction feasible

solutions have been found. But, those often rely on pretrained statistical

models which need large amounts of input data. An example is the appli-

cation of ngram language models for sequence prediction, where Manning

and Schütze (1999) note that ngram models to be eﬀective usually need

large amounts of training data.

Even the full amount of Old Georgian

data digitally available

is still not large enough to perform and thoroughly

evaluate the majority of such approaches. Methodologically, there is an

additional factor complicating assessment: Any gap can hold any number

of abbreviations making gap ﬁller generation (GFG) a more complex task

than simple sequence prediction or abbreviation generation.

Additionally, the epigraphic record is very heterogeneous with the easier

cases often already manually solved. In order to exemplify the heterogeneity

of the epigraphic record and thus the range a tool aiding in reconstruction

has to be able to address, we give some examples from the Georgian

inscriptions, images come from the Corpus of Old Georgian Inscriptions.

On the one end there are inscriptions with so fragmentary evidence that no

super computer can probably ever help to decipher the message, on the

other there are reconstructions as trivial as to be performed without much

eﬀort even by laymen correctly.

See chapter 6 for discussion, (Manning and Schütze, 1999, p.201): ”In general, four gram models

do not become usable until one is training on several tens of millions of words of data.”

A subcorpus on Old Georgian not containing diﬀerent redactions of the same texts from the TITUS

server comprises roughly 4 million words.

Collection under: http://titus.fkidg1.uni-frankfurt.de/texte/etcg/cauc/ageo/inscr/carcera/

carce.htm

28

JLCL

Gepi

Although some letters survived, the extent and the placement of the gaps make a

complete reconstruction almost impossible.

Here, the broken oﬀ part to the left can be reconstructed with a good level of

conﬁdence, since each line has more surviving than missing letters and since the

amount of missing letters is of a minor magnitude. Also, there are few abbreviations.

Finally, in this example, only abbreviations of moderate diﬃculty have to be

expanded, which could be done by a beginner to epigraphy knowing Old Georgian.

JLCL 2016 – Band 31 (2)

29

Hoenen, Samushia

Facing such variety and diﬃculties, rather than provide a completed

feasible solution for GFG, which given the scope of this article and the current

landscape of computational epigraphic reconstruction would seem unrealistic,

this paper primarily aims at making computational scholars aware of the

inseparable interplay of abbreviation and gap which so characterizes the

epigraphic record in many epochs, regions and languages and which may

represent a new computational challenge. Towards a proof of concept

however, a basic method for GFG is being formulated and tested.

In order to demonstrate the utility of such a tool, we concentrate on

examples promising to yield some useful results. This is why we restrict

ourselves for the time being to single words if possible not on broken oﬀ

edges, the extent of which is unclear. We argue that if we are able to

provide useful guesses for these, then larger units might be in reach for

future research.

3 Method

We are looking for lexical matches of the gap context comparing two meth-

ods, pure frequency based cues and word embedding based cues. In the

face of formulas and a very standardized language of inscriptions pure

frequencies and conditional frequencies (of a word given a predecessor or

follower) may be a suﬃciently strong cue and could be feasible as a base-

line. Word embeddings Mikolov et al. (2013a) on the other hand can be

used for sequence prediciton since their training includes an optimization

of immediate contextual similarity. To this end, semantic and syntactic

similarity are captured by word embeddings which makes them a possible

cue for a gap ﬁller. Furthermore, Mikolov et al. (2013b) state that ”neural

network based language models signiﬁcantly outperform N-gram models”,

compare Bengio et al. (2003); Mikolov et al. (2011); Schwenk (2007). In

fact, in a pre-experiment, we found an ordinary n-gram language model to

perform well only for the preditcion of the content of very short gaps. In

order to generate word embeddings, we can later use for GFG, we compiled

yet another corpus of Old Georgian texts fromt the TITUS archive.

3.1 Corpus

The TITUS website provides texts for many ancient languages and is (one

of) the most comprehensive archive(s) (in close collaboration with the GNC)

for Old Georgian text. Among the texts present are the Bible, lectionaries,

hagiographical, theological and apocryphal texts, psalms and odes, song,

historical texts, homiletic and exegetic texts, liturgical texts, canonical law

texts, philosophical texts and for instance an astrological and a grammatical

30

JLCL

Yüklə 3,56 Mb.

Dostları ilə paylaş:

1 ... 6 7 8 9 10 11 12 13 14