Gepi
text. Texts are often translated (from ancient Greek, Syriac and Armenian).
For details, see the website.
These texts are thus of different genres than inscriptions, but the language
stage is essentially the same. We extracted a subcorpus of roughly 4 million
words, where for instance critical aparatuses or differing redactions have been
omitted and only the critical text been taken. Punctuation as present has
been separated from the tokens and tokens arranged so that sentences were
approximately arranged in lines, as is usual for word embedding training.
We do not entirely exclude the presence of noise. From the corpus, word2vec
generated roughly 230, 000 vectors for the wordforms in the corpus.
3.2 Approach
Each inscription was processed. An inscription internal gap was detected
and the following mechanism tried to generate a filler.
8
First, the context
of the gap was extracted. Here, ignoring any space within the gap(s),
the continuous context to the left and right of the current gap until the
next/previous space character has been extracted. If a subsequent gap was
directly adjacent, there would be more than one gaps in such a ”word”. For
instance same[bis]a[j] was so captured. Square brackets mark gaps, letters
within are reconstructed, samebisaj means ’from Trinity’. This was then
converted to a regex by simply substituting the letters of the gap by a
placeholder: (same...a.).
9
The regex was then used to match all candidates conforming to this
pattern in the database of words of the Old Georgian corpus from which
the word embeddings have also been generated.
10 11
The outcome was a
list of candidate fillers. However, depending on the extent and position
of the gap, the number of fillers could easily become large. When one
thinks of an aid for reconstruction, confronting the reconstructor with a
large number of tokens, half of which is probably quite unlikely, will not be
statisfactory. Therefore, we tried to use different cues for ranking candidates.
Each candidate receives three values, firstly the cosine vector similarity to
the word vector of the previous word if this word is in the lexicon (in the
8
For the time being, gaps at the beginning or end of lines were left aside since their extent may
be hard to estimate and validate, while the mechanism elaborated is under more based on gap
breadth information.
9
A more sophisticated approach would be to use the true breadth of the gap if annotated in absolute
numbers. One could then assign a typical breadth to each letter and check if fillers are suitable
for the gap at hand. A possible filler in its most condensed form should not be longer than the
gap and its fully spelled out form not shorter. The way in which to generate the most condensed
or gap matching form would pertain to abbreviation generation. One could for instance take the
first letter of each word.
10
For training, we used the default settings apart from the minCount feature which we set to 1 since
the corpus is not huge and in this way, we capture hapax legomena and significantly enlarge
embedding vocabulary.
11
Neo4j was our data base system accessed via java.
JLCL 2016 – Band 31 (2)
31
Hoenen, Samushia
Old Georgian corpus) and not gappy. Secondly, the same for the following
word and thirdly, the filler is given its frequency from the Old Georgian
corpus (in which it must occur since it has been extracted from there). From
these values, we generate a weight for the candidates. This enables us then
to sort the fillers and so limit the number of candidates to be offered to the
reconstructor to a number he/she may deem useful. Such a number could
be the top 10 for instance. However, since the weight may be the same
for several candidates, we allow the limit to be exceeded and to include all
candidates with a weight larger or equal to that of the tenth candidate.
3.3 Results
In the Old Georgian corpus, in overall 65 gappy ”words” no fillers were
retrieved for 25, whereas 26 of the 40 filler sets contained the correct filler.
Results are encouraging, the correct filler was generated at a ratio of 0.65
decreasing to 0.6 if limiting the output to the top weights as described above
using frequency as weighting cue. Recall was 0.62.
12
The average number of
top fillers generated was roughly 7 which is not too confusable in terms of
overview. Limiting to the top ranks had another effect, namely the Damerau
Levenshtein distance, Damerau (1964) of the fillers to the correct solution
decreased for more than half to be 4.21 which shows that even if the correct
filler has not been included it is not unlikely to have a moderately similar
or similar word in the top fillers. Using the word embedding cues, and only
in case the previous and next word would both not be present in the word
embedding lexicon frequency, deteriorated results. Taking the similarity to
the last word if present (otherwise frequency) resulted in precision of 0.525,
taking similarity to the next word if present (otherwise frequency) resulted
in a precision of 0.5 and combinations such as the average of the similarities
of last and next words if both were present, if only one of them was present
that value and only in case none was present frequency, was still worse at
0.475. The correctly captured fillers from the embeddings however were
largely coinciding but no subset of the ones captured by frequency.
3.4 Discussion and Post Experiment
Frequency is plainly connected with probability through bare counts, while
word embeddings capture syntagmatic and paradigmatic similarity. Similari-
ties to previous and following words performed at an almost equal level. One
reason for the reduced performance in respect to frequency using only the
immediately adjacent neighbours (the larger the context, the more probable
the occurrence of a gap or abbreviation within the context) could be the
12
Using fewer dimensions (10) only improved the result marginally in lowering the average rank at
which the correct filler was to be found.
32
JLCL
Gepi
nature of language, namely the dichotomy between high frequency function
words and content words. For the former, naturally many more neighbours
exist in a training text which may make their vectors less specific and in
turn less reliable ranking cues.
However, the amount of data tested on is not sufficient to conclude
anything. Consequently, we tested the same method on 1, 000 inscriptions
of a Latin data base for inscriptions, the Epigraphic Database Heidelberg.
13
The text database, we used for computing word embeddings and extracting
the frequency lexicon were the Latin Wikipedia
14
and the classical texts of
the Packard Humanities Institute.
15
We found the same pattern as in Old
Georgian, meanwhile with lower recall and precision. Frequency alone was
the best cue. More research may shed light on the true reasons behind this
pattern.
For the Latin dataset, another approach is feasible. A preliminary at-
tempt is described and first results given in what follows as an outlook
to future elaboration. Since there are more than 70, 000 inscriptions, it
makes sense to produce for instance 10 chunks of equal size (in terms
of numbers of inscriptions). Then for gaps in any 1 chunk symbolizing
the unreconstructed inscriptions, one can extract context and use pattern
search in the 9 training chunks symbolizing the until then reconstructed
inscriptions. Since inscriptions are highly stereotypical this may lead to
good results. To test this assumption, in a small follow up on Latin, we
extracted the context, this time regardless of spaces until the next/previous
gap and then matched the resulting pattern left_context.+right_context
from the inscriptions in the 9 held out chunks. The matches were checked
for suitable length given the gap breadth. As described above, the most
condensed form (each word abreviated by its first letter) should not be
significantly longer and the fully spelled out form not significantly shorter
than the space the gap offers. For each gap, we decreased context size by one
character on each side and repeated matching until the context consisted in
one character only. The matches (or fillers) were weighted for the length
of the context at which they had been matched and for frequency of the
match (
n
i
=1
|left_context| + |right_context| for n matches).
Here, we found a recall of 0 .33 with the correct filler being present at a
rate of 0.46 in the filler sets, whilst at a rate of 0.2 the correct filler was
in the top 10 fillers. The average DL of the top fillers was 3.96 for those
filler sets, where the correct match was not present. The highest ratios of
correct matches per context lengths were achieved with longest contexts and
balanced contexts, but length was a better cue than balance. To exemplify,
a context of 5 characters to the left and 5 characters to the right is in total
13
http://edh-www.adw.uni-heidelberg.de/
14
https://la.wikipedia.org: last accessed on 16.12.2015
15
http://latin.packhum.org: last accessed on 09.12.2015
JLCL 2016 – Band 31 (2)
33
Hoenen, Samushia
Figure 2:
Simple front end example: The slightly transformed original transcription is visible in the first line.
For each word, either the user is provided with a dropdown list restricted to the most probable
automatically generated fillers or can choose to edit the gap filler manually. Abbreviations can be
collapsed or expanded to support imagination of an original in the reconstructive process. A sortable
table at the bottom informs him/her of all possibilities, which can be considerably more than in the
thresheld dropdown menu and which contains additional information.
a 10 character context, but these contexts captured relatively less correct
fillers than contexts of 0 characters to the left but 9 to the right. It seems
that the longer a match in a continuous context, the better the cue.
4 User Interface
For the development of an ”EpigraphyHelper” a user front end would have
to be set-up. A sketch of this has been done using a platform independent
HTML/Javascript solution which provides the most probable fillers in a
drop-down container, see Figures 2 and 3. Future design and usability of this
rendering should be made subject of an online survey for domain experts.
The front end once finalized is completely independent from the technical
backend, which is to say that the current method of generating gap fillers
can be exchanged as soon as more effective methods are available.
The front end has several features. Firstly, the original transcription
is presented on top, giving the epigrapher the context, he/she habitually
encounters. Then per line each word is rendered either as non changeable
text if readable as such on the inscription (’ex votu posuit’ in the example) or
34
JLCL
Gepi
Figure 3:
More complex example: Per word a separate line is assumed. Gaps filled by previous scientists as most
probable reconstructions are editable. Visible and reconstructed abbreviations can be collapsed and
expanded. They are marked differently.
JLCL 2016 – Band 31 (2)
35
Hoenen, Samushia
with an expanded abbreviation, where the expansion is rendered in red and
italics (Val
eria
in the example) or for each word which was reconstructed
within a gap an editable textfield appears with yellow background, where ab-
breviations are marked by slashes (P/ublio/ in the example). Abbreviations
can be collapsed and expanded per button. Finally, for gaps which have not
been reconstructed, the algorithm computes candidates as described above
and displays them in a drop-down list (Argivo in the example). Following
Shneiderman’s principle Shneiderman (1996), only in case of demand can
the user obtain a sortable table with many more possibilities and additional
annotations for the words. If none of the proposed fillers is deemed cor-
rect, the user can activate a ’Customize Input’ button and transform the
drop-down into an editable textfield.
5 Future Work and Experimentation
Of course, epigraphers have tried hard and succeeded well in reconstructions
of inscriptions both internalizing abbreviation and text completion, connect-
ing this with typical functional epigraphic formula and historical events and
individuals. The frustration of not being able to decipher the message of
certain inscriptions is probably a well known feeling for epigraphers and each
one may have found his/her own way to deal with this issue. An application
of AI to epigraphy should therefore not pretend to be a remedy for this
frustration since it is clear that a too fragmentary inscription cannot be
reasonably reconstructed. Yet, since the capacity of the human brain to
keep in mind all relevant words, names and orthographic variants (and in
consequence all possible reconstructions) is limited in comparison with a
computer, a reconstruction aid may, in the best case find reasonable fillers
for some of the not yet reconstructed gaps which had slipped the conscience
of previous reconstructors. Especially in the case of Named Entities, a vast
array of possibilities exists.
Furthermore, unreasonable candidates which such a system produces
can be discarded by a human expert in a matter of seconds, leaving the
technologically open user with a positive net outcome. One crucial question
for an application of AI to epigraphy will be at which rate good guesses can
be produced. Assessing such a question, databases such as the epigraphic
database Heidelberg or the database Clauss/Slaby
16
may be seen as a
benchmark dataset which will enable computer scientists to evaluate their
approaches against the recontructions already conducted.
16
http://www.manfredclauss.de/
36
JLCL
Gepi
6 Conclusion
A corpus of Old Georgian inscriptions has been compiled. Additionally,
a tool for epigraphic reconstruction has been sketched in order to raise
awareness in the Computer Scientific community that such a task exists,
that data sets for its evaluation exist and that the task is an interesting
computational challenge involving both abbreviation resolution or generation
and sequence prediction. To this end, we have only been able to show that
in the case of Old Georgian, thanks to a large resource of Old Georgian texts
from the internet, a reconstruction aid can produce on average 7 fillers for
roughly 60% of gaps with 60% of filler sets containing the correct solution.
We hope for more general results and solutions in the future.
References
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic language
model. Journal of machine learning research, 3(Feb):1137–1155.
Bodel, J. (2012). Latin Epigraphy and the IT Revolution. Proceedings of the British Academy,
177:275 – 296.
Boeder, W. (1987). Versuch einer sprachwissenschaftlichen Interpretation der altgeorgischen
Abkürzungen. Revue des études géorgiennes et caucasiennes, 3:33 – 81.
Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors.
Communications of the ACM, 7:171–176.
Danelia, K. and Sarzhveladze, Z. (2012). Kartuli p’aleograpia [Georgian Paleography]. Nekeri.
Driscoll, M. (2009). Marking up abbreviations in old norse-icelandic manuscripts. In Medieval
Texts–Contemporary Media. Ibis.
Gippert, J. (1995). Titus. das projekt eines indogermanistischen thesaurus ("titus. the project
of an indo-european thesaurus"). LDV-Forum, 12(2):35–47.
Manning, C. D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing.
MIT Press, Cambridge, MA, USA.
McWilliam, L., Schepman, A., and Rodway, P. (2009). The linguistic status of text message
abbreviations: An exploration using a stroop task. Computers in Human Behavior, 25(4):970–
974.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word
representations in vector space. CoRR, abs/1301.3781.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013b). Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Černock`y, J. (2011). Empirical
evaluation and combination of advanced language modeling techniques. In Twelfth Annual
Conference of the International Speech Communication Association.
JLCL 2016 – Band 31 (2)
37
Hoenen, Samushia
Schwenk, H. (2007). Continuous space language models. Computer Speech & Language,
21(3):492–518.
Shneiderman, B. (1996). The eyes have it: A task by data type taxonomy for information
visualizations. In Proceedings of the 1996 IEEE Symposium on Visual Languages, VL ’96,
pages 336–, Washington, DC, USA. IEEE Computer Society.
Slattery, T. J., Schotter, E. R., Berry, R. W., and Rayner, K. (2011). Parafoveal and foveal
processing of abbreviations during eye fixations in reading: making a case for case. Journal
of Experimental Psychology: Learning, Memory, and Cognition, 37(4):1022.
Taylor, W. (1953). Cloze procedure: A new tool for measuring readability. Journalism
Quarterly, 30:415–433.
Yang, D., Pan, Y.-c., and Furui, S. (2009). Automatic chinese abbreviation generation using
conditional random field. In Proceedings of Human Language Technologies: The 2009 Annual
Conference of the North American Chapter of the Association for Computational Linguistics,
Companion Volume: Short Papers, pages 273–276. Association for Computational Linguistics.
38
JLCL
Author Index
Marcel Bollmann
Sprachwissenschaftliches Institut
Ruhr-Universit¨
at Bochum
Universit¨
atsstr. 150, 44801 Bochum, Germany
https://marcel.bollmann.me/
bollmann@linguistics.rub.de
Stefanie Dipper
Sprachwissenschaftliches Institut
Ruhr-Universit¨
at Bochum
Universit¨
atsstr. 150, 44801 Bochum, Germany
https://www.linguistics.rub.de/~dipper
dipper@linguistics.rub.de
Armin Hoenen
CEDIFOR
Institut f¨
ur Empirische Sprachwissenschaft
Johann Wolfgang Goethe-Universit¨
at Frankfurt
Senckenberganlage 31 (Juridicum), 60325 Frankfurt am Main,
Germany
https://hucompute.org/team/armin-hoenen/
hoenen@em.uni-frankfurt.de
Thomas Klein
Institut f¨
ur Germanistik und Vergleichende Literaturwissenschaft
Universit¨
at Bonn
Am Hofgarten 22, 53113 Bonn, Germany
https://www.germanistik.uni-bonn.de/institut/abteilungen/
germanistische-linguistik/abteilung/personal/klein_thomas
thomas.klein@uni-bonn.de
Roland Mittmann
Institut f¨
ur Empirische Sprachwissenschaft
Johann Wolfgang Goethe-Universit¨
at Frankfurt
Senckenberganlage 31 (Juridicum), 60325 Frankfurt am Main,
Germany
http://titus.uni-frankfurt.de/personal/mittmann.htm
mittmann@em.uni-frankfurt.de
Florian Petran
Sprachwissenschaftliches Institut
Ruhr-Universit¨
at Bochum
Universit¨
atsstr. 150, 44801 Bochum, Germany
https://www.linguistics.rub.de/~petran/
florian.petran@gmail.com
Lela Samushia
Institut f¨
ur Empirische Sprachwissenschaft
Johann Wolfgang Goethe-Universit¨
at Frankfurt
Senckenberganlage 31 (Juridicum), 60325 Frankfurt am Main,
Germany
samushia@em.uni-frankfurt.de
Dostları ilə paylaş: |