Tech Explained

Statistical Machine Translation using Moses

Machine translation is a growing research area due to its application in providing fast

PCQ Bureau

01 Apr 2009 11:41 IST

New Update

-Nirav Shah and Sumit Goswami

Advertisment

Machine translation is a growing research area due to its application in

providing fast and meaningful translation of text and speech from one language

to another language. It can be done based either on rules where rules are

applied to convert one language into another language or through statistical

machine translation (SMT). Statistical machine translation uses statistical

methods to translate with the help of parallel corpus. It uses word based

translation method or phrase based translation. The success of machine

translation system depends on how well one language's words are aligned with

another language's words. Statistical machine translation system allows training

of translation model for any language. The requirement for doing statistical

machine translation is a bi-lingual parallel corpus. There are various types of

translation methods which are used like factored, beam-search and phrase based.

Direct Hit!

Applies To: Language Translators,

Computational Linguists, NLP Researchers

Price: Free (GNU GPL)

USP: Create your language translator

Primary Link:

www.statmt.org/moses

Google Keywords: Statistical Machine Translation, Moses

Various machine translation tools are available like Apertium (GNU license),

OpenLogos which is the open source version of Logos Machine Translation System,

SYSTRAN which is one of the oldest Machine Translation company and Moses (GNU

General Public License).

Advertisment

Moses is a phrase based machine translation tool for converting one language

to another language. Technical details regarding this are available on Moses

website, www.statmt.org/moses. In this article, we will give a short

step-by-step process for converting text from one language to another, for

example Hindi to English using Moses machine translation tool.

Parallel Corpus

Prepare Parallel Corpus for source (English) and target (Hindi) language

which will be used for training the language model. This corpus can be prepared

from your existing translated data or can be obtained from Internet free of cost

or with a price for e.g. EMILLE corpus, a free version of which is available for

research purposes. Similarly, a parallel corpus with a smaller size is required

for tuning as well as for testing the model.

Data preparation

The parallel corpus is converted to a format that is suitable for Giza++.

Giza is an open source tool based on IBM model and is used for word alignment.

Before training Moses the following software should be downloaded:

Advertisment

SRILM < http:// tinyurl. com/ dx8m5m > - This is the tool developed by

Stanford research institute for building statistical language model.
GIZA++ < http:// tinyurl. com/ cdem45 > or < http:// giza-pp. googlecode.

com/ > — This tool is developed by Franz Josef Och. This tool implements

different models like HMM and also performs word alignment.
MKCLS <http://tinyurl.com/ c83mpx > or < http://www.fjoch. com/mkcls. html

> - This tool is also developed by Franz Josef Och and used for training word

classes which is used in SMT model. For MKCLS and GIZA++ latest GNU compiler

is required.
Moses < http://sourceforge.net/ projects/mosesdecoder >
Additional scripts < http://tinyurl. com/ cp8xz7 > - These are the

additional scripts for Moses training and tuning.

For this article we are keeping /usr/home/PCQ/demo as the root directory for

installation. The steps given below are relative to this root directory.

Getting started

Create a directory 'srilm' in root directory, move downloaded srilm tar file to

this directory, extract and run the 'make file'.

Then move GIZA++ tar file to the root directory and extract. This process will

create a directory GIZA++-v2. Run the make file inside GIZA++-v2 and thereafter

again run make with argument snt2cooc.out from the same directory. This will

produce GIZA++ and snt2cooc.out. Create a directory 'bin' in root directory and

copy these two files to 'bin'.

Advertisment

Now move the mkcls tar file to root directory and extract. This will make

mkcls-v2 directory in which you should run the make file. This will produce

mkcls file which is copied to 'bin'.

Create a directory named Moses under root and copy Moses tar file to this

directory and extract it. Now change to Moses directory and execute regenerate—makefiles.sh.

and thereafter execute configure script with option ./configure --with-srilm=/usr/home/PCQ/demo/srilm

and then run 'make — J 4'.

Now move to 'bin' directory under root and create 'moses-scripts' directory

in it. Thereafter move to moses/scripts directory under root and run make

release. Move scripts tar file to root directory and extract it. This process

will complete the set up of Moses.

Advertisment

Training the Translator

Now we can process our parallel corpus. Create directories

working-dir/corpus in root directory and copy the English and Hindi corpus in

it. Filter out long sentences and lowercase the data. We do not require to

lowercase Hindi data because it is in wx format. Now we create a directory 'lm'

inside 'working-dir' directory to build the language model. Again we need to

lowercase the English language data. These two steps creates English language

model data. Now we can build the model using SRILM. After this process, our

language model is ready and now we can train our model. After training, for

better performance we can tune it also. However, tuning is not mandatory.

For translating, English language sentence can be given as input echo 'This

is a small House.' | /usr/home/PCQ/ demo/moses/moses-cmd/src/moses -f moses. ini

> out.txt.

We can find its translated Hindi language sentence by 'cat out.txt' which

will contain this: yaha eka CotA AvAsagqha Hai.

Conclusion

If the corpus is large in size then it requires huge memory, at least 2 GB

for building the translation model. Besides, a few steps in the process can take

few minutes to few hours depending on the processing power, memory and size of

training corpus. The model given here is base line model and research is going

on to improve the results of translation. If the corpus is large enough, then

the trained model will have a high translation accuracy.

machine machines

Advertisment