Home Edition

User Guide

The CASMACAT Home Edition is a version of the CASMACAT Workbench that can be installed on your own home computer. It has been tested on Windows, MacOS, and Linux. You can train your own machine translation systems with your translation memories or publicly available data.

This user guide assumes that have already installed the CASMACAT Home Edition. Once installed, you will access it with your browser (currently only Chrome is supported), while the back end is running in a virtual machine on your computer.

The web interface offers this main menu:

First Steps: Using the Toy Model

The initial installation of the CASMACAT Home Edition ships with a very simple French-English machine translation system, which purely exists for demonstration purposes. We expect that a typical user of the \casmacat Home Edition will want to build a customized machine translation system optimized for a given translation task.

The "Translate" section of the administrative interface may not be available, and it is place are links to start the MT and CAT server. If that is the case, click these links to make the Translate section appear.

To try out the toy model, click on Translate New Document. The next page asks you to upload a document to be translated. Use this example file. After clicking Start Translating, the CASMACAT Workbench will pop up in a new tab of your browser (you may notice that the workbench runs on port 8000, while the administrative interface runs on the default port 80 - if you do not notice this, nevermind, it does not matter).

To learn more how to use the workbench, refer to its user guide. Be aware that the machine translation system is very good for the given example file, but otherwise does very badly. The reason being that the example file is part of the training data for the system.

Training a Customized Machine Translation System

The main purpose of the CASMACAT Home Edition is use your own machine translation engine that best fits your purposes. The Home Edition allows you to build such a engine from existing translation memories. You can also download engines and share them with other users of the CASMACAT Home Edition.

You start the process of building your own machine translation system by clicking the link Build new prototype in the main menu. Select the language pair you are working on, and specify a few more settings:

Training Data

The essential ingredient for statistical machine translation is training data - specifically texts alongside their translations in segment-aligned (or "sentence-aligned") format. This type of data is the same as a translation memory. If you can extract the translation memory in XLIFF format from your existing translation tool, you are good to go.

The more training data you have, the higher quality your machine translation system will achieve. How much training data is needed to get reasonable results, depends on the language pair, the narrowness of the domain, and other factors. One million words is an often-cited number, but you may get good enough results with 100,000 words in a very repetitive domain.

The CASMACAT Home Edition allows you to use additional training data from public sources. After selecting the language pair, the link Public corpora will appear. Click it, and you will see a list of publicly available corpora for your language pair. We integrated into the workbench a web service that queries public repositories such as the OPUS project and a repository hosted at the CASMACAT web site. For many European languages a diversity of training data is thus accessible.

For instance, if you download the KDE4 corpus (an open source software documentation corpus), and then select it as part of your training data, the engine building page will look as follows:

Tuning and Test Data

The translation model is trained from the selected training data, but two additional sets of translated segments are needed: the tuning set that is used to find optimal weights for different model components (for instance to balance literalness and fluency), and the test set that is used to give an assessment of the translation quality of the trained system.

If you plan to experiment with multiple selections of training data, it is better to set aside dedicated tuning and test sets and keep them constant. These sets should consist of about 1000 segments, maybe more.

If you just want to build a basic system, just stick to the defaults: the tuning and test sets will be automatically sampled from the training data.

Training Process

Training will take several hours and maybe more than a day. It uses the Experimental Management System developed as part of the Moses statistical machine translation toolkit. If you train multiple systems, the management systems is smart enough to figure out when processing steps from prior runs can be re-used.

Progress is reported in the status window of the administrative interface at the bottom of the page.

Managing Machine Translation Engines

In the terminology of the CASMACAT Home Edition, machine translation model training results in a prototype. Technically, this is a training run based on a given configuration specification, resulting in a collection of machine translation model components, and that was tuned and tested on specified test sets. Its performance is measure on a test set. You may chose at any time to change the training conditions and build another prototype, which may share components with a prior one. For instance, if a different tuning set is chosen, then the language model and translation model will be re-used, but the model weights will be changed.

Once, the you are satisfied with the test performance of a built prototype, it can be converted into a machine translation engine. An engine - according to our definition - is the set of all relevant model files and settings in a self-contained package. Engines can be downloaded from the CASMACAT Home Edition and shared with other translators, who can upload it into their \casmacat Home Edition installation.

Below is a screen shot of the administrative interface view that allows the management of machine translation engines. Here, several prototypes have been build for various language pairs (English-French, English-Spanish, French-English, Spanish-English). Some of the prototypes have been converted into engines. One of the engines, here the French-English "(x1) Toy" engine has been selected for deployment, meaning that the backend of the CASMACAT workbench currently uses it.

The administrative interface allows the deployment of any available engine, the creation of engines from prototypes, and the deletion of engines or prototypes. Ongoing training runs can be interrupted or resumed.

A link to Inspect Details in Prototype Factory connects to the web interface of the Experimental Management System, which gives more details of the training process. The information shown there is highly technical, so do not worry if it is somewhat opaque.

Once an engine is built it can downloaded. It takes a few seconds (or minutes) to package it up into a compressed file collection (with the extension tgz). These packaged engines can be shared with other users of the CASMACAT Home Edition. The main menu has the link Upload Engine that allows for just that.

Settings

You may set the specific functionality of the workbench under CAT Settings.