Journal for Language Technology and Computational Linguistics

Yüklə 3,56 Mb.

Pdf görüntüsü

səhifə	5/14
tarix	22.07.2018
ölçüsü	3,56 Mb.
	#57639

1 2 3 4 5 6 7 8 9 ... 14

5 CorA-XML document format
JLCL 2016 – Band 31 (2) 9

Petran, Bollmann, Dipper, Klein

Token

Wise

dine

burste

dinen

lif

pos (token)

VVIMP

DPOSA

KON

DPOSA

pos (type)

DPOS

inﬂ

Sg.2

Fem.Akk.Pl.st

Akk.Pl

–

Masc.Akk.Sg.st

Akk.Sg

inﬂClass

–

st(u).Fem

–

st.Masc

Table 3:

Final annotations for this fragment. Lemma and other annotations are omitted here, but

are visible in the ﬁnal corpus. The tokens are shown in simpliﬁed spelling.

(VVIMP), a pos (type) tag for a full verb, infl showing only 2nd person singular form,

and an inflClass tag showing the weak inﬂection class. The second token is annotated

as a possessive determinative that precedes its noun phrase (DPOSA). This is not

explicitly annotated in the internal tagset, but it can be easily inferred by precedence

being the default case for determinatives.

Diﬃculties arise in cases where HiTS makes distinctions that are not made in the

internal tagset. For example, the noun ‘burſte’ (‘breast’) is annotated as belonging

to the strong inﬂection class in HiTS, but the internal tagset does not capture this

information. This had to be solved by a combination of the analysis of the lemma form

and list lookup: if the lemma ends in a consonant, the noun has a strong inﬂection

class. Lemmas ending in ‘-e’ have to be looked up for weak or strong inﬂection classes.

Lemmas ending in other vowels are always weakly inﬂected. Similar lists had to be

built for other parts of speech that lacked disctinctions, such as pronouns, articles and

numerals, as well as verbs, auxiliar verbs, and modal verbs.

Some distinctions could not be reconstructed by looking at the token alone. One

example for this is the annotation of pronominal adverbs that introduce a relative clause

(as opposed to interrogative usage) as PAVREL. Reconstructing these distinctions would

have required usage of the syntactic context which the tools are not capable of. In that

sense, the tagset used here represents a subset of the entire HiTS tagset.

The ﬁnal output of this annotation process is a ﬂat XML ﬁle based on the modernized

tokenization only; the historical tokenization has to be inferred using the transcription

standards (see Sec. 3). It is converted into CorA-XML format (see Sec. 5) to re-gain

ﬂexibility with regards to the tokenization layers.

5 CorA-XML document format

For further processing of the annotated data, we choose to convert it into the CorA-XML

document format. This XML-based format was originally developed for the web-based

annotation tool CorA

(Bollmann et al., 2014), and is speciﬁcally designed for the

needs of historical documents. CorA is actively used to annotate historical texts for

the reference corpora of Early New High German (ReF) and Middle Low German/Low

Rhenish (ReN), as well as the Anselm corpus of Early New High German (Dipper

and Schultz-Balluﬀ, 2013). Converting ReM to the same format therefore signiﬁcantly

https://www.linguistics.rub.de/comphist/resources/cora/

8

JLCL

ReM: A reference corpus of Middle High German

Figure 1: Simpliﬁed CorA-XML representation

of “ſo biſtu” with annotations

. . .

ſo

biſtu

. . .

Figure 2: Simpliﬁed example of the layout hi-

erarchy in CorA-XML

increases reusability of tools and facilitates further processing of the data. Furthermore,

we are actively working on an automatic conversion from CorA-XML to a TEI-compatible

format, which will open up the data for use with an even wider range of existing tools.

CorA-XML distinguishes between two diﬀerent tokenization layers, whose elements

are represented by and tags respectively, corresponding to the distinction

between diplomatic and annotated tokens in ReM (cf. Sec. 3). Since there can be a one-

to-many (or even many-to-many) relationship between elements of these layers (as in the

‘biſtu’ example from Fig. 4 below), they are always wrapped by a virtual element

which establishes this correspondence. Within each layer, diﬀerent representations of the

wordforms can be included, e.g., a UTF-8 representation conserving special characters

(such as ‘ſ’), or a pure ASCII representation (mapping ‘ſ’ to ‘s’). On the annotated

tokenization layer, arbitrary annotations can be added to each token, encoding the

linguistic layer and punctuation layer described in Sec. 3. Figure 1 gives a simpliﬁed

example of the CorA-XML representation for the sequence ‘ſo biſtu’ from Figure 4.

Layout information is encoded via a hierarchy of layout elements, namely ‘pages’,

‘columns’, and ‘lines’. Each instance of an element contains a pointer to one or more

elements of the next lower type in the hierarchy; i.e., pages refer to columns, which in

turn refer to lines. Each ‘line’ element ﬁnally refers to one or more diplomatic tokens.

Figure 2 provides an example visualization of this hierarchy. A valid layout speciﬁcation

in a CorA-XML document requires that each diplomatic token is contained in the span

of exactly one ‘line’ element, thereby allowing to derive an exact page, column, and line

speciﬁcation for each diplomatic token.

JLCL 2016 – Band 31 (2)

9

Yüklə 3,56 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 14