English-Hindi Translation in 21 Days Ondřej Bojar, Pavel Straňák, Daniel Zeman

English-Hindi Translation in 21 Days

Data

Impact of additional data

Impact of additional data

Moses setup

Alignment heuristics

Alignment heuristics: CS-EN

Moses settings

Rule-based reordering

Reordering example

Unsupervised stem-suffix segmentation

Core Idea

Paradigms get filtered

Paradigm Examples (en)

Paradigm Examples (hi)

Learning Phase Outcomes

Morphemic Segmentation

Impact of our preprocessing

Dostları ilə paylaş:

English-Hindi Translation in 21 Days Ondřej Bojar, Pavel Straňák, Daniel Zeman

English-Hindi Translation in 21 Days

Ondřej Bojar, Pavel Straňák, Daniel Zeman

ÚFAL MFF, Univerzita Karlova, Praha

Data

Parallel (en-hi)

Monolingual (hi)

Impact of additional data

Larger parallel data helps

Impact of additional data

Larger Hindi LM data does not help

Moses setup

Alignment heuristics: grow-diag-final-and (GDFA)

Alignment heuristics

Alignment heuristics: CS-EN

Moses settings

Alignment using first four characters (“light stemming”)

MERT tuning of feature weights

Rule-based reordering

Move finite verb forms to the end of the sentence (not crossing punctuation, “that”, WH-words).

Transform prepositions to postpositions

TectoMT, Morče tagger (perceptron), McDonald’s MST parser

Reordering example

Unsupervised stem-suffix segmentation

Factors in Moses

Core Idea

Assumption: 2 morphemes: stem+suffix

All splits of all words

Set of suffixes seen with the same stem is a paradigm

Paradigms get filtered

Remove the paradigm if:

Merge paradigms A and B if:

Paradigm Examples (en)

Suffixes: e, ed, es, ing, ion, ions, or

Stems: calibrat, decimat, equivocat, …

Suffixes: e, ed, es, ing, ion, or, ors

Stems: aerat, authenticat, disseminat, …

Suffixes: 0, d, r, r’s, rs, s

Stems: analyze, chain-smoke, collide, …

Paradigm Examples (hi)

Suffixes: 0, ा, े, ों

Stems: अहात, खांच, घुटन, चढ़ाव, …

Suffixes: 0, ं, ंगे, गा

Stems: कराए, दर्शाए, फेंके, बदले, …

Suffixes: 0,ि,ियां, ियों

Stems: अनुभूत, अभिव़्यक्त, …

Learning Phase Outcomes

List of paradigms

List of known stems

List of known suffixes

List of stem-suffix pairs seen together

How can we use that to segment a word?

Morphemic Segmentation

Consider all possible splits of the word

We use 4 (longest known suffix)

Impact of our preprocessing