Evaluation of Hindi→English, Marathi→English and English→Hindi clir at fire 2008 Nilesh Padariya, Manoj Chinnakotla, Ajay Nagesh and Om P. Damani



Yüklə 524 b.
tarix02.01.2018
ölçüsü524 b.


Evaluation of Hindi→English, Marathi→English and English→Hindi CLIR at FIRE 2008


CLIR System Architecture



System Flow Example



First Participation in CLEF 2007

  • Developed basic Query Translation system for Hindi to English and Marathi to English

  • Transliteration Algorithm

    • Simple rule-based system
    • Edit-distance based index-lookup to retrieve index tokens
    • Accuracy: ~ 65% at top 20
  • Translation Disambiguation

  • Performance at CLEF 2007

    • Hindi to English: 67.06 % of Monolingual
    • Marathi to English: 56.09% of Monolingual


Failure Analysis for CLEF 2007



Transliteration

  • Collection of parallel list of names for evaluation

    • Available datasets too small
    • Do not contain a good mix of words from native and loans words
    • Our current dataset: around 25K words
  • Algorithmic Improvements

  • Current accuracy figures

    • Hindi to English: 80% accuracy at rank 5
    • English to Hindi: Evaluation to be done


Translation Disambiguation

  • Empirical study on translation disambiguation strategies and parameter choices

  • Choice of disambiguation strategy

    • Best Pair
    • Best cohesion
    • Best sequence
    • Iterative
  • Various parameters to the iterative disambiguation algorithm

    • Number of final candidates to choose
    • Use of weights?
    • Similarity measure
  • Datasets used: TREC AP, CLEF 2007

  • Best choice: Iterative, Dice Coefficient, 1 translation candidate, weights do not improve much



Only Transliteration on Query?

  • Motivation

    • Quite common to use actual Hindi word in English documents in Indian domains
    • Examples:
    • NEs crucial for fetching relevant documents
  • Experiments

    • Transliterate whole query
    • Transliterate only NEs, no translation


Overall Results (Title Only)



P-R Curves for English Target



P-R Curves for Hindi Target



Results of Transliteration Experiment



P-R Curves for Transliteration Expt.



Conclusion

  • Improved transliteration and translation disambiguation modules based on CLEF 2007 analysis

  • Hindi to English CLIR performance is 75% of monolingual and Marathi to English is 64% of monolingual

  • Need further investigation on results especially the monolingual baselines – Hindi, Marathi and English

  • Only transliteration achieves around 35% of monolingual performance in Hindi and 25% in Marathi



Acknowledgements

  • The second author is supported by the Infosys Fellowship Award

  • Project linguists at CFILT, IIT Bombay



References

  • S. Tarek and K. Grzegorz, Substring-Based Transliteration, In Proceedings of ACL, 2007

  • F. Huang, Cluster-specific named entity transliteration, In HLT ’05, pages 435–442, Morristown, NJ, USA, 2005.

  • I. Ounis, G. Amati, P. V., B. He, C. Macdonald, and Johnson, Terrier Information Retrieval Platform, In Proceedings of ECIR 2005, volume 3408 of Lecture Notes in Computer Science, pages 517–519. Springer, 2005.

  • Christof Monz and Bonnie J. Dorr, Iterative Translation Disambiguation for Cross-Language Information Retrieval, In SIGIR ’05, Pages 520-527, New York, USA, ACM Press

  • Nicola Bertoldi and Marcello Federico, Statistical Models for Monolingual and Bilingual Information Retrieval, Information Retrieval, 7 (1-2): 53-72, 2004



References (Contd..)

  • Martin Braschler and Carol Peters, Cross Language Evaluation Forum: Objectives, Results, Achievements,Information Retrieval, 7 (1-2): 7-31, 2004

  • Ricardo BaezaYates and Berthier RibeiroNeto, Modern Information Retrieval, Pearson Education, 2005.

  • Dan Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology,Cambridge University Press, 1997.



Thanks!




Dostları ilə paylaş:


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2019
rəhbərliyinə müraciət

    Ana səhifə