First Year

In the first year of the project (which mostly matches calendar year 2012), the CASMACAT project created a solid foundation for collaboration and innovation, carried out its first field trial, and developed a number of novel technologies for computer aided translation.

Review Presentations

Results of the first year of the project were presented in Luxembourg on November 28, 2012. The following slides give details about the progress.

Overview

Recently, there have been significant improvements to machine translation technology, but the vast majority of this work has been targeted towards bulk translation that is "good enough" or "fit for use". High quality translation for publication is still almost exclusively provided by human translators.

Development of computer aided translation tools lags behind. While the use of translation memories is common among professional translators, machine translation is only slowly adopted. Currently, its use is very simplistic, typically limited to post-editing, and based on machine translation systems that are not sufficiently adapted to the task at hand. Hence, adoption of machine translation is often resisted by human translators.

There is a clear need for better CAT technology that CASMACAT attempts to provide.

Scope and Objectives

The CASMACAT project carries out cognitive studies of translator behavior, leading to insights into interface design to develop a workbench with novel types of assistance for human translators, such as interactive translation prediction, interactive editing and reviewing, and advanced translation models. The effectiveness of the workbench is demonstrated in field tests with professional translators, with the main goal of increased translator productivity.

Scientific Overview: Year 1

In Year 1 of the project, two prototypes were developed. A first field trial was staged (in month 8), and its results analyzed.

A number of advanced methods were implemented, such as new paradigms for interactive machine translation (e-pen interaction, stochastic error-correction models, prediction for tree-based models), online adaptation of machine translation, new machine translation models (phrase-based hidden semi-Markov models, finite-state approach), and new types of assistance for human translators (word alignment visualization and confidence measures).

Workbench

The Workbench consists of a web-based editor which connects to a server that communicates with the MT engine and a database. Additional components are a management tool, handwriting recognition server, and the integration of the eye tracker.

The modular design allows integration of CASMACAT functionality into existing CAT tools.

The workbench was jointly developed with the Matecat project.

Prototype I

The initial prototype was developed by Copenhagen Business School in consultation with Celer Solutions. It allows for post-editing of machine translation, presenting source and target side by side. The tool is web-based and uses PHP, mySQL, and AJAX.

Prototype I was used in the field trial.

Eye Tracking

The workbench integrates support for an eye tracker, which allows detailed logging of the translation process.

Logging of eye gaze is implemented in the browser with Javascript and a plug-in. The log allows analysis and replay, but some manual error correction is required to overcome mistakes of the tracker.

Prototype II

A second prototype was mainly developed by the Matecat project and advances of the CASMACAT were integrated.

Field Trial

The field trial tackled news stories, translated from English into Spanish by 5 professional translators. We compared translation from scratch with post-editing machine translation. The machine translation system was a competitive Moses system, originally built by the University of Edinburgh for the WMT 2012 evaluation campaign.

Many valuable data resources were generated from the field trial. One of the translators used an eye tracker, which enables cognitive studies into translator behavior. Knowing which words were corrected by post-editors provides supervised training data for research into better word confidence measures. Post-editing time per sentence similarly fosters research into better sentence confidence measures. The translations generated by post-editing are also closer to the original output of the machine translation system than arbitrarily produced reference translation, thus constituting better test data for interactive machine translation research.

Data and models from the field trial are publicly available.

Interactive translation prediction

In the CASMACAT project, we explore the extension of the post-editing paradigm by means of interactive machine translation. This technology implies that the machine translation system proposes a new and improved translation hypothesis every time the human translator fixes some part of the sentence. Thus, a closer interaction between the machine translation system and the human expert is obtained, with the purpose of taking advantage of both the efficiency provided by the machine translation system and the correctness ensured by the human translator. The following example illustrates a typical ITP session, in which the human translator only needs to type three characters in order to obtain the complete and correct translation. In a post-editing scenario, she would have needed to erase three words and type two.

Prediction from Parse Forest

Interactive machine translation, i.e., a computer aided translation method that predicts the completion of the sentence, given the beginning of a translation from a human translator, has been based previously on the search graph of phrase-based models. In the CASMACAT project we extend this method to syntax-based models, which have demonstrated better translation quality for some language pairs.

Study of E-pen Gestures

The CASMACAT project explores a new input device, an e-pen, which is more adequate for correction than the traditional mix of keyboard and mouse, which requires frequent switching between input devices. We defined a number of gestures that can be used for correcting machine translation output. Furthermore, we allow the translator to introduce and correct text by handwriting words.

Interactive Editing

The CASMACAT project investigates a number of technologies to assist the editing of translations. Sentence-level confidence estimates provide guidance to decide which translations should be edited by the user. For example, if editing a translation would require more effort than translating it from scratch such translation is not shown to the post-editor. Word-level confidence estimates inform the user of the parts of the automatic translations that are more likely to be wrong.

By online word alignment, the CASMACAT workbench highlights the source word that is aligned to the target words under the current cursor position in the edit box.

Rules from Translation Memory

Translation memories are an established technology for human translators. By retrieving a sentence pair with a similar source side to the current input sentence, translators are able to work from a target side translation with a small number of edits. But such fuzzy matches can be further improved by translating the mismatched source words with a statistical machine translation model. We integrated this approach into the Moses decoder.

Online Updating

A common complaint of post-editors of machine translation output is that they have to correct the same errors over and over again. This is due to the fact that conventional MT systems do not modify the translation models after the initial batch learning stage. In the CASMACAT project, we work on online model updating methods that instantly feed back the translations validated by the user into the translation model, thus providing evidence for the correct translation of the same words or phrases the next time around. The proposed online updating methods can be applied both in the post-editing (PE) and the interaction translation prediction (ITP) scenarios.

Dissemination

Partners of the CASMACAT project organized the following events:

Results of the project were published in 19 publications, one Ph.D. thesis and one MSc thesis. The workbench was demonstrated at the AMTA conference.