Journal for Language Technology and Computational Linguistics

Yüklə 3,56 Mb.

Pdf görüntüsü

səhifə	14/14
tarix	22.07.2018
ölçüsü	3,56 Mb.
	#57639

1 ... 6 7 8 9 10 11 12 13 14

Gepi

text. Texts are often translated (from ancient Greek, Syriac and Armenian).

For details, see the website.

These texts are thus of diﬀerent genres than inscriptions, but the language

stage is essentially the same. We extracted a subcorpus of roughly 4 million

words, where for instance critical aparatuses or diﬀering redactions have been

omitted and only the critical text been taken. Punctuation as present has

been separated from the tokens and tokens arranged so that sentences were

approximately arranged in lines, as is usual for word embedding training.

We do not entirely exclude the presence of noise. From the corpus, word2vec

generated roughly 230, 000 vectors for the wordforms in the corpus.

3.2 Approach

Each inscription was processed. An inscription internal gap was detected

and the following mechanism tried to generate a ﬁller.

First, the context

of the gap was extracted. Here, ignoring any space within the gap(s),

the continuous context to the left and right of the current gap until the

next/previous space character has been extracted. If a subsequent gap was

directly adjacent, there would be more than one gaps in such a ”word”. For

instance same[bis]a[j] was so captured. Square brackets mark gaps, letters

within are reconstructed, samebisaj means ’from Trinity’. This was then

converted to a regex by simply substituting the letters of the gap by a

placeholder: (same...a.).

The regex was then used to match all candidates conforming to this

pattern in the database of words of the Old Georgian corpus from which

the word embeddings have also been generated.

10 11

The outcome was a

list of candidate ﬁllers. However, depending on the extent and position

of the gap, the number of ﬁllers could easily become large. When one

thinks of an aid for reconstruction, confronting the reconstructor with a

large number of tokens, half of which is probably quite unlikely, will not be

statisfactory. Therefore, we tried to use diﬀerent cues for ranking candidates.

Each candidate receives three values, ﬁrstly the cosine vector similarity to

the word vector of the previous word if this word is in the lexicon (in the

For the time being, gaps at the beginning or end of lines were left aside since their extent may

be hard to estimate and validate, while the mechanism elaborated is under more based on gap

breadth information.

A more sophisticated approach would be to use the true breadth of the gap if annotated in absolute

numbers. One could then assign a typical breadth to each letter and check if ﬁllers are suitable

for the gap at hand. A possible ﬁller in its most condensed form should not be longer than the

gap and its fully spelled out form not shorter. The way in which to generate the most condensed

or gap matching form would pertain to abbreviation generation. One could for instance take the

ﬁrst letter of each word.

For training, we used the default settings apart from the minCount feature which we set to 1 since

the corpus is not huge and in this way, we capture hapax legomena and signiﬁcantly enlarge

embedding vocabulary.

Neo4j was our data base system accessed via java.

JLCL 2016 – Band 31 (2)

31

Hoenen, Samushia

Old Georgian corpus) and not gappy. Secondly, the same for the following

word and thirdly, the ﬁller is given its frequency from the Old Georgian

corpus (in which it must occur since it has been extracted from there). From

these values, we generate a weight for the candidates. This enables us then

to sort the ﬁllers and so limit the number of candidates to be oﬀered to the

reconstructor to a number he/she may deem useful. Such a number could

be the top 10 for instance. However, since the weight may be the same

for several candidates, we allow the limit to be exceeded and to include all

candidates with a weight larger or equal to that of the tenth candidate.

3.3 Results

In the Old Georgian corpus, in overall 65 gappy ”words” no ﬁllers were

retrieved for 25, whereas 26 of the 40 ﬁller sets contained the correct ﬁller.

Results are encouraging, the correct ﬁller was generated at a ratio of 0.65

decreasing to 0.6 if limiting the output to the top weights as described above

using frequency as weighting cue. Recall was 0.62.

The average number of

top ﬁllers generated was roughly 7 which is not too confusable in terms of

overview. Limiting to the top ranks had another eﬀect, namely the Damerau

Levenshtein distance, Damerau (1964) of the ﬁllers to the correct solution

decreased for more than half to be 4.21 which shows that even if the correct

ﬁller has not been included it is not unlikely to have a moderately similar

or similar word in the top ﬁllers. Using the word embedding cues, and only

in case the previous and next word would both not be present in the word

embedding lexicon frequency, deteriorated results. Taking the similarity to

the last word if present (otherwise frequency) resulted in precision of 0.525,

taking similarity to the next word if present (otherwise frequency) resulted

in a precision of 0.5 and combinations such as the average of the similarities

of last and next words if both were present, if only one of them was present

that value and only in case none was present frequency, was still worse at

0.475. The correctly captured ﬁllers from the embeddings however were

largely coinciding but no subset of the ones captured by frequency.

3.4 Discussion and Post Experiment

Frequency is plainly connected with probability through bare counts, while

word embeddings capture syntagmatic and paradigmatic similarity. Similari-

ties to previous and following words performed at an almost equal level. One

reason for the reduced performance in respect to frequency using only the

immediately adjacent neighbours (the larger the context, the more probable

the occurrence of a gap or abbreviation within the context) could be the

Using fewer dimensions (10) only improved the result marginally in lowering the average rank at

which the correct ﬁller was to be found.

32

JLCL

Gepi

nature of language, namely the dichotomy between high frequency function

words and content words. For the former, naturally many more neighbours

exist in a training text which may make their vectors less speciﬁc and in

turn less reliable ranking cues.

However, the amount of data tested on is not suﬃcient to conclude

anything. Consequently, we tested the same method on 1, 000 inscriptions

of a Latin data base for inscriptions, the Epigraphic Database Heidelberg.

The text database, we used for computing word embeddings and extracting

the frequency lexicon were the Latin Wikipedia

and the classical texts of

the Packard Humanities Institute.

We found the same pattern as in Old

Georgian, meanwhile with lower recall and precision. Frequency alone was

the best cue. More research may shed light on the true reasons behind this

pattern.

For the Latin dataset, another approach is feasible. A preliminary at-

tempt is described and ﬁrst results given in what follows as an outlook

to future elaboration. Since there are more than 70, 000 inscriptions, it

makes sense to produce for instance 10 chunks of equal size (in terms

of numbers of inscriptions). Then for gaps in any 1 chunk symbolizing

the unreconstructed inscriptions, one can extract context and use pattern

search in the 9 training chunks symbolizing the until then reconstructed

inscriptions. Since inscriptions are highly stereotypical this may lead to

good results. To test this assumption, in a small follow up on Latin, we

extracted the context, this time regardless of spaces until the next/previous

gap and then matched the resulting pattern left_context.+right_context

from the inscriptions in the 9 held out chunks. The matches were checked

for suitable length given the gap breadth. As described above, the most

condensed form (each word abreviated by its ﬁrst letter) should not be

signiﬁcantly longer and the fully spelled out form not signiﬁcantly shorter

than the space the gap oﬀers. For each gap, we decreased context size by one

character on each side and repeated matching until the context consisted in

one character only. The matches (or ﬁllers) were weighted for the length

of the context at which they had been matched and for frequency of the

match (

n

i

|left_context| + |right_context| for n matches).

Here, we found a recall of 0.33 with the correct ﬁller being present at a

rate of 0.46 in the ﬁller sets, whilst at a rate of 0.2 the correct ﬁller was

in the top 10 ﬁllers. The average DL of the top ﬁllers was 3.96 for those

ﬁller sets, where the correct match was not present. The highest ratios of

correct matches per context lengths were achieved with longest contexts and

balanced contexts, but length was a better cue than balance. To exemplify,

a context of 5 characters to the left and 5 characters to the right is in total

http://edh-www.adw.uni-heidelberg.de/

https://la.wikipedia.org: last accessed on 16.12.2015

http://latin.packhum.org: last accessed on 09.12.2015

JLCL 2016 – Band 31 (2)

33

Hoenen, Samushia

Figure 2:

Simple front end example: The slightly transformed original transcription is visible in the ﬁrst line.

For each word, either the user is provided with a dropdown list restricted to the most probable

automatically generated ﬁllers or can choose to edit the gap ﬁller manually. Abbreviations can be

collapsed or expanded to support imagination of an original in the reconstructive process. A sortable

table at the bottom informs him/her of all possibilities, which can be considerably more than in the

thresheld dropdown menu and which contains additional information.

a 10 character context, but these contexts captured relatively less correct

ﬁllers than contexts of 0 characters to the left but 9 to the right. It seems

that the longer a match in a continuous context, the better the cue.

4 User Interface

For the development of an ”EpigraphyHelper” a user front end would have

to be set-up. A sketch of this has been done using a platform independent

HTML/Javascript solution which provides the most probable ﬁllers in a

drop-down container, see Figures 2 and 3. Future design and usability of this

rendering should be made subject of an online survey for domain experts.

The front end once ﬁnalized is completely independent from the technical

backend, which is to say that the current method of generating gap ﬁllers

can be exchanged as soon as more eﬀective methods are available.

The front end has several features. Firstly, the original transcription

is presented on top, giving the epigrapher the context, he/she habitually

encounters. Then per line each word is rendered either as non changeable

text if readable as such on the inscription (’ex votu posuit’ in the example) or

34

JLCL

Gepi

Figure 3:

More complex example: Per word a separate line is assumed. Gaps ﬁlled by previous scientists as most

probable reconstructions are editable. Visible and reconstructed abbreviations can be collapsed and

expanded. They are marked diﬀerently.

JLCL 2016 – Band 31 (2)

35

Hoenen, Samushia

with an expanded abbreviation, where the expansion is rendered in red and

italics (Val

eria

in the example) or for each word which was reconstructed

within a gap an editable textﬁeld appears with yellow background, where ab-

breviations are marked by slashes (P/ublio/ in the example). Abbreviations

can be collapsed and expanded per button. Finally, for gaps which have not

been reconstructed, the algorithm computes candidates as described above

and displays them in a drop-down list (Argivo in the example). Following

Shneiderman’s principle Shneiderman (1996), only in case of demand can

the user obtain a sortable table with many more possibilities and additional

annotations for the words. If none of the proposed ﬁllers is deemed cor-

rect, the user can activate a ’Customize Input’ button and transform the

drop-down into an editable textﬁeld.

5 Future Work and Experimentation

Of course, epigraphers have tried hard and succeeded well in reconstructions

of inscriptions both internalizing abbreviation and text completion, connect-

ing this with typical functional epigraphic formula and historical events and

individuals. The frustration of not being able to decipher the message of

certain inscriptions is probably a well known feeling for epigraphers and each

one may have found his/her own way to deal with this issue. An application

of AI to epigraphy should therefore not pretend to be a remedy for this

frustration since it is clear that a too fragmentary inscription cannot be

reasonably reconstructed. Yet, since the capacity of the human brain to

keep in mind all relevant words, names and orthographic variants (and in

consequence all possible reconstructions) is limited in comparison with a

computer, a reconstruction aid may, in the best case ﬁnd reasonable ﬁllers

for some of the not yet reconstructed gaps which had slipped the conscience

of previous reconstructors. Especially in the case of Named Entities, a vast

array of possibilities exists.

Furthermore, unreasonable candidates which such a system produces

can be discarded by a human expert in a matter of seconds, leaving the

technologically open user with a positive net outcome. One crucial question

for an application of AI to epigraphy will be at which rate good guesses can

be produced. Assessing such a question, databases such as the epigraphic

database Heidelberg or the database Clauss/Slaby

may be seen as a

benchmark dataset which will enable computer scientists to evaluate their

approaches against the recontructions already conducted.

http://www.manfredclauss.de/

36

JLCL

Gepi

6 Conclusion

A corpus of Old Georgian inscriptions has been compiled. Additionally,

a tool for epigraphic reconstruction has been sketched in order to raise

awareness in the Computer Scientiﬁc community that such a task exists,

that data sets for its evaluation exist and that the task is an interesting

computational challenge involving both abbreviation resolution or generation

and sequence prediction. To this end, we have only been able to show that

in the case of Old Georgian, thanks to a large resource of Old Georgian texts

from the internet, a reconstruction aid can produce on average 7 ﬁllers for

roughly 60% of gaps with 60% of ﬁller sets containing the correct solution.

We hope for more general results and solutions in the future.

References

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic language

model. Journal of machine learning research, 3(Feb):1137–1155.

Bodel, J. (2012). Latin Epigraphy and the IT Revolution. Proceedings of the British Academy,

177:275 – 296.

Boeder, W. (1987). Versuch einer sprachwissenschaftlichen Interpretation der altgeorgischen

Abkürzungen. Revue des études géorgiennes et caucasiennes, 3:33 – 81.

Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors.

Communications of the ACM, 7:171–176.

Danelia, K. and Sarzhveladze, Z. (2012). Kartuli p’aleograpia [Georgian Paleography]. Nekeri.

Driscoll, M. (2009). Marking up abbreviations in old norse-icelandic manuscripts. In Medieval

Texts–Contemporary Media. Ibis.

Gippert, J. (1995). Titus. das projekt eines indogermanistischen thesaurus ("titus. the project

of an indo-european thesaurus"). LDV-Forum, 12(2):35–47.

Manning, C. D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing.

MIT Press, Cambridge, MA, USA.

McWilliam, L., Schepman, A., and Rodway, P. (2009). The linguistic status of text message

abbreviations: An exploration using a stroop task. Computers in Human Behavior, 25(4):970–

974.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Eﬃcient estimation of word

representations in vector space. CoRR, abs/1301.3781.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013b). Eﬃcient estimation of word

representations in vector space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Černock`y, J. (2011). Empirical

evaluation and combination of advanced language modeling techniques. In Twelfth Annual

Conference of the International Speech Communication Association.

JLCL 2016 – Band 31 (2)

37

Hoenen, Samushia

Schwenk, H. (2007). Continuous space language models. Computer Speech & Language,

21(3):492–518.

Shneiderman, B. (1996). The eyes have it: A task by data type taxonomy for information

visualizations. In Proceedings of the 1996 IEEE Symposium on Visual Languages, VL ’96,

pages 336–, Washington, DC, USA. IEEE Computer Society.

Slattery, T. J., Schotter, E. R., Berry, R. W., and Rayner, K. (2011). Parafoveal and foveal

processing of abbreviations during eye ﬁxations in reading: making a case for case. Journal

of Experimental Psychology: Learning, Memory, and Cognition, 37(4):1022.

Taylor, W. (1953). Cloze procedure: A new tool for measuring readability. Journalism

Quarterly, 30:415–433.

Yang, D., Pan, Y.-c., and Furui, S. (2009). Automatic chinese abbreviation generation using

conditional random ﬁeld. In Proceedings of Human Language Technologies: The 2009 Annual

Conference of the North American Chapter of the Association for Computational Linguistics,

Companion Volume: Short Papers, pages 273–276. Association for Computational Linguistics.

38

JLCL

Author Index

Marcel Bollmann

Sprachwissenschaftliches Institut

Ruhr-Universit¨

at Bochum

Universit¨

atsstr. 150, 44801 Bochum, Germany

https://marcel.bollmann.me/

bollmann@linguistics.rub.de

Stefanie Dipper

Sprachwissenschaftliches Institut

Ruhr-Universit¨

at Bochum

Universit¨

atsstr. 150, 44801 Bochum, Germany

https://www.linguistics.rub.de/~dipper

dipper@linguistics.rub.de

Armin Hoenen

CEDIFOR

Institut f¨

ur Empirische Sprachwissenschaft

Johann Wolfgang Goethe-Universit¨

at Frankfurt

Senckenberganlage 31 (Juridicum), 60325 Frankfurt am Main,

Germany

https://hucompute.org/team/armin-hoenen/

hoenen@em.uni-frankfurt.de

Thomas Klein

Institut f¨

ur Germanistik und Vergleichende Literaturwissenschaft

Universit¨

at Bonn

Am Hofgarten 22, 53113 Bonn, Germany

https://www.germanistik.uni-bonn.de/institut/abteilungen/

germanistische-linguistik/abteilung/personal/klein_thomas

thomas.klein@uni-bonn.de

Roland Mittmann

Institut f¨

ur Empirische Sprachwissenschaft

Johann Wolfgang Goethe-Universit¨

at Frankfurt

Senckenberganlage 31 (Juridicum), 60325 Frankfurt am Main,

Germany

http://titus.uni-frankfurt.de/personal/mittmann.htm

mittmann@em.uni-frankfurt.de

Florian Petran

Sprachwissenschaftliches Institut

Ruhr-Universit¨

at Bochum

Universit¨

atsstr. 150, 44801 Bochum, Germany

https://www.linguistics.rub.de/~petran/

ﬂorian.petran@gmail.com

Lela Samushia

Institut f¨

ur Empirische Sprachwissenschaft

Johann Wolfgang Goethe-Universit¨

at Frankfurt

Senckenberganlage 31 (Juridicum), 60325 Frankfurt am Main,

Germany

samushia@em.uni-frankfurt.de

Yüklə 3,56 Mb.

Dostları ilə paylaş:

1 ... 6 7 8 9 10 11 12 13 14