Tech Explained

Many Languages on the Net

PCQ Bureau

12 Sep 2002 12:26 IST

New Update

The Internet today has to deal with multilinguality. People speak different languages and the number of natural languages along with their dialects is estimated to be close to 4,000. Of the top 100 languages in the world, English occupies the top position, with Hindi coming fifth and Marathi fourteenth.

Advertisment

This is where UNL (Universal Networking Language) comes in. It is a digital meta language for describing, summarizing, refining, storing and disseminating information in a machine-independent and human-language-neutral form. UNL represents information (ie, meaning) sentence by sentence. Sentence information is represented as a hyper-graph having concepts as nodes and relations as arcs. This hyper-graph is also represented as a set of directed binary relations, each between two of the concepts present in the sentence. Concepts are represented as character-strings called UWs (Universal Words).

The encoded UNL is used not only for machine translation, but also for other document-processing activities. The encoding process can be looked upon as the process of knowledge extraction. The extracted knowledge is used for automatic hyper linking, summarizing and categorizing of documents.

UNL can describe and disseminate information over the net irrespective of the language used by different people

Advertisment

The UNL vocabulary consists of the following.

UWs (Universal Words): Labels that represent word meaning

Relation Labels: Tags that represent the relationship between UWs

Advertisment

Attribute Labels: Express additional information about the UWs that appear in a sentence

A UNL expression can be seen as a UNL graph. For example,

John, who is the chairman of the company, has arranged a meeting at his residence.

Advertisment

The UNL for the sentence is

mod(chairman(icl>post), company)

aoj(chairman, John)

agt(arrange.@complete, John)

pos(residence, John)

obj(arrange, meeting)

plc(arrange, residence)

You can see the UNL graph for the sentence in the accompanying picture.

Advertisment

In the above, agt means the agent, obj the object, plc the place, aoj the attributed object and mod the modifier. The detailed list of such relations can be found in the reference cited in th einbox next page. Also the icl construct helps restrict the meaning of the word. In the above we show only one example of such restriction, viz., chairman(icl>post).

Conversion to and from UNL expressions

Encoding into UNL is first of all a parsing problem. The analysis process makes use of a framework for morphological, syntactic and semantic analysis synchronously. It analyses sentences by accessing a knowledge-rich lexicon and interpreting the Analysis Rules, which essentially capture the language phenomena. The process of formulating the rules is programming a sophisticated symbol-processing machine. Thus, the process of converting natural-language sentences into UNL involves constructing analysis rules and building a knowledge-rich lexicon linking the language words with

UWs covering the extremely varied language phenomena and concepts.

An example of UNL graph

Advertisment

Some examples of dictionary entries for Hindi are given below.

The attributes in the lexicon are collectively called Lexical Attributes (both semantic and syntactic attributes). The syntactic attributes include the word category: noun, verb, adjectives, etc. and attributes like person and number for nouns and tense in for verbs.

Decoding the UNL expressions into a sentence of any target language is done using word dictionary and the generation rules of the target language. Initially, syntax planning of the target words is done, after which the morphology is generated to produce a natural sentence.

Advertisment

Some statistics

We have constructed analysers for Hindi and English and the generator for Hindi. The work on the generator for Marathi has also has been started. This needed linking English, Hindi and Marathi language strings with the UWs. Also the Analysis and Generation rules for these languages had to be made. Below is some quantitative information for the English and Hindi languages.

Number of Entries in the Hindi-UW dictionary: 70,000

Number of Analysis Rules for English: ~5000

Number of Analysis Rules for Hindi: ~6000

Number of Generation Rules for Hindi: ~6500

Other applications

Since the UNL expressions can be looked upon as the extracted knowledge of the documents, we have carried out research on how to use these for various document-processing tasks. Notable among them are automatic hyper linking and text clustering. In the former, the keywords–as candidates for setting up links from–are obtained from the UNL graphs. Heavily linked

word-as are possible candidates for keywords. Similarly, the linkage and relation label information in the UNL graphs are used for constructing the document vectors in the semantic dimension. These vectors are then processed with clustering algorithms. The experimental results are promising.

UNL in India
In India, UNL work is being carried on at the Computer Science and Engineering Department, IIT Bombay. Here, we do sentence-level encoding of English, Hindi and Marathi into the UNL form and decode this information into Hindi and Marathi, thus creating a way of semi-automated translation from English to Hindi and Marathi and also between Hindi and Marathi. For more on UNL, visit www.unl. ias.unu.edu

Present and future

UNL has been found to be very useful for various multilingual information tasks as well as document processing applications. The UNL graph is looked upon as the extracted knowledge from the documents.

The countries participating in this project are Japan, China, Indonesia, India, Jordan, Russia, Italy, France, Spain and Brazil. The United Nations Head Quarters in Geneva are developing multilingual information access systems using the UNL.

In IIT Bombay the following high-impact projects are making use of the UNL representation for various text processing and language technology tasks.

Multi-lingual Web

UNL can be a very effective vehicle for developing multilingual Web-based applications. The UNL expressions provide the meaning content of the text and search can be carried out on this meaning base instead of the text. This, of course, means developing a novel kind of search-engine technology. The merit of such a system is that the information in one language need not be stored in multiple languages.

The Center for Indian Language Technology Solutions (www.cse.iitb.ac.in/tukaram) funded by the Ministry of Information Technology, India.

The Center for Intelligent Internet Research (www.cse.iitb. ac.in/laiir) funded by Tata Consultancy Services.

Media Lab Asia (www.ircc. iitb.ac.in/~MLAsia), funded by the Ministry of Information Technology, India and with participation from the Masachusetts Institute of

Technlogy, USA.

The commercial level exploitation of the UNL technology for the Internet scale multilingual access is expected to happen in a couple of years’ time.

Pushpak Bhattacharyya, Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Advertisment