The Internet today has to deal with multilinguality. People speak different languages and the number of natural languages along with their dialects is estimated to be close to 4,000. Of the top 100 languages in the world, English occupies the top position, with Hindi coming fifth and Marathi fourteenth.
This is where UNL (Universal Networking Language) comes in. It is a digital meta language for describing, summarizing, refining, storing and disseminating information in a machine-independent and human-language-neutral form. UNL represents information (ie, meaning) sentence by sentence. Sentence information is represented as a hyper-graph having concepts as nodes and relations as arcs. This hyper-graph is also represented as a set of directed binary relations, each between two of the concepts present in the sentence. Concepts are represented as character-strings called UWs (Universal Words).
The encoded UNL is used not only for machine translation, but also for other document-processing activities. The encoding process can be looked upon as the process of knowledge extraction. The extracted knowledge is used for automatic hyper linking, summarizing and categorizing of documents.
UNL can describe and disseminate information over the net irrespective of the language used by different people |
The UNL vocabulary consists of the following.
UWs (Universal Words): Labels that represent word meaning
Relation Labels: Tags that represent the relationship between UWs
Attribute Labels: Express additional information about the UWs that appear in a sentence
A UNL expression can be seen as a UNL graph. For example,
John, who is the chairman of the company, has arranged a meeting at his residence.
The UNL for the sentence is
mod(chairman(icl>post), company)
aoj(chairman, John)
agt(arrange.@complete, John)
pos(residence, John)
obj(arrange, meeting)
plc(arrange, residence)
You can see the UNL graph for the sentence in the accompanying picture.
In the above, agt means the agent, obj the object, plc the place, aoj the attributed object and mod the modifier. The detailed list of such relations can be found in the reference cited in th einbox next page. Also the icl construct helps restrict the meaning of the word. In the above we show only one example of such restriction, viz., chairman(icl>post).
Conversion to and from UNL expressions
Encoding into UNL is first of all a parsing problem. The analysis process makes use of a framework for morphological, syntactic and semantic analysis synchronously. It analyses sentences by accessing a knowledge-rich lexicon and interpreting the Analysis Rules, which essentially capture the language phenomena. The process of formulating the rules is programming a sophisticated symbol-processing machine. Thus, the process of converting natural-language sentences into UNL involves constructing analysis rules and building a knowledge-rich lexicon linking the language words with
UWs covering the extremely varied language phenomena and concepts.
|
Some examples of dictionary entries for Hindi are given below.
The attributes in the lexicon are collectively called Lexical Attributes (both semantic and syntactic attributes). The syntactic attributes include the word category: noun, verb, adjectives, etc. and attributes like person and number for nouns and tense in for verbs.
Decoding the UNL expressions into a sentence of any target language is done using word dictionary and the generation rules of the target language. Initially, syntax planning of the target words is done, after which the morphology is generated to produce a natural sentence.
Some statistics
We have constructed analysers for Hindi and English and the generator for Hindi. The work on the generator for Marathi has also has been started. This needed linking English, Hindi and Marathi language strings with the UWs. Also the Analysis and Generation rules for these languages had to be made. Below is some quantitative information for the English and Hindi languages.
Number of Entries in the Hindi-UW dictionary: 70,000
Number of Analysis Rules for English: ~5000
Number of Analysis Rules for Hindi: ~6000
Number of Generation Rules for Hindi: ~6500
Other applications
Since the UNL expressions can be looked upon as the extracted knowledge of the documents, we have carried out research on how to use these for various document-processing tasks. Notable among them are automatic hyper linking and text clustering. In the former, the keywords–as candidates for setting up links from–are obtained from the UNL graphs. Heavily linked
word-as are possible candidates for keywords. Similarly, the linkage and relation label information in the UNL graphs are used for constructing the document vectors in the semantic dimension. These vectors are then processed with clustering algorithms. The experimental results are promising.
|
Present and future
UNL has been found to be very useful for various multilingual information tasks as well as document processing applications. The UNL graph is looked upon as the extracted knowledge from the documents.
The countries participating in this project are Japan, China, Indonesia, India, Jordan, Russia, Italy, France, Spain and Brazil. The United Nations Head Quarters in Geneva are developing multilingual information access systems using the UNL.
In IIT Bombay the following high-impact projects are making use of the UNL representation for various text processing and language technology tasks.
|
The Center for Indian Language Technology Solutions (www.cse.iitb.ac.in/tukaram) funded by the Ministry of Information Technology, India.
The Center for Intelligent Internet Research (www.cse.iitb. ac.in/laiir) funded by Tata Consultancy Services.
Media Lab Asia (www.ircc. iitb.ac.in/~MLAsia), funded by the Ministry of Information Technology, India and with participation from the Masachusetts Institute of
Technlogy, USA.
The commercial level exploitation of the UNL technology for the Internet scale multilingual access is expected to happen in a couple of years’ time.
Pushpak Bhattacharyya, Department of Computer Science and Engineering, Indian Institute of Technology, Bombay