11. LANGUAGE TECHNOLOGY

11.0 Introduction

Keeping the complex linguistic scenario in mind and the need for building software systems for Indian languages, the government started IT initiatives in Indian languages and knowledge based systems in the last couple of decades in many technology centers, universities, and institutes funded by the Technology Development for Indian Languages (TDIL) program of the Ministry of Information Technology (MIT) and also the UNDP. University Grants Commission (UGC) also started supporting minor and major research projects involving development of linguistic parsers and machine translation. Mention may be made of the Indian Institutes of Technology (IITs), Indian Institutes of Information Technology (IIITs), Centre for Development of Advanced Computing (C-DAC), Indian Institute of Science (IIS), Indian Statistical Institute (ISI), Jawaharlal Nehru University (JNU), Mahatma Gandhi International Hindi University (MGIHU), major Sanskrit universities and other institutes for significant contributions in this field. The private enterprises like Tata Institute of Fundamental Research (TIFR), Tata Consultancy Services (TCS) have also funded Indian language technology R&D.

11.1 The Indian English Corpora

As will notice, very little work has been done on IE per se in terms of modern technology, though the Indian enterprise in computational linguistics has addressed some specific problems in the areas of Indian health, officialese and journalism. The Kolhapur Corpus of Indian English (Shastri 1985; 1988) contains 1 million words of written Indian English from the year 1978. Its texts were selected from the same text categories as the Brown Corpus and are available from ICAME.

11.2 TDIL Programme

Under the Technology Development for Indian Languages (TDIL) Programme, a number of development activities were initiated since 1991. These are in the areas of development of Corpora of Text of various Languages, Machine Aided Translation among various languages, Speech Synthesis Systems and Optical Character Recognition Systems etc.

1. Corpora Development:

Machine Readable Corpora of Indian English has been developed and the Corpora so developed is being centrally maintained at Central Institute of Indian Languages (CIIL), Mysore. A variety of software was also developed for grammatical tagging of corpora, word count and frequency count. The spell checkers for some of these languages have also been developed simultaneously.

2. Machine Aided Translation:

According to Bharati, Chaitanya and Sangal (Internet), computers are being increasingly used not only for machine translation and lexical resources across Indian languages but also between English and Indian languages. There are at least three systems available for translation from English to Hindi. In the Anusaaraka systems, the idea is to use the machines for doing linguistic analysis and word for word translation leaving the task of interpretation to the reader.

(a) Prototype Machine Aided Translation System for Human Aided Translation of English News Stories to Hindi has been developed and is proposed to be operationalised for deploying it as a useable product for Press agencies.

(b) Machine Aided Translation System for translation from English to Hindi for the specific domain of Public Health Campaign has been developed and is being installed at the user(s) sites for field-testing.

The September, 2001 issue of Vishvabharat lists some of the current projects

§ MaTra - Human Aided Machine Translation System for English-Hindi primarily for translation of official letters of appointments; it specializes in officialese.

§ MANTRA - Machine Assisted Translation Tool from English-Hindi; this system caters largely to the newspaper industry having specific lexicon for different domains e.g. sports. Judiciary, politics etc. The word 'party' in the context of a news story about elections would mean a 'political entity'.

§ ANUSAARKA -This seems to be a more versatile system. It is based on a division of labour between machine and man, the simplest and most robust output assigns the role of syntax to man and those of syntax to machine. It is based on University of Pennsylvania's XTAG-based super-tagger and light dependency analyzer.

11.3 Language Technology Research at I.I.T., Kanpur:

In 1990-92 Professor R.M.K. Sinha conceptualized the design of a Machine Aided Translation system for translation from English to Indian Languages. This system was named as ANGLABHARTI and the underlying methodology named as ANGLABHARTI Technology or ANGLABHARTI Approach.

In 1992-94, IITK implemented the Anglabharti system on Sun OS environment for translation from English to Hindi. All the modules of the systems were implemented, tested and demonstrated. During 1995-97, Department of Electronics, Govt. of India, sanctioned a grant-in-aid for implementation of the project titled "Machine Aided Translation from English to Hindi for standard documents (domain of Public Health Campaign) based on ANGLABHARTI approach" for which ERDC (with its office at Lucknow and now moved to NOIDA) was associated for implementation and commercialization of this software on a PC platform in the domain of public health campaign. The ANGLABHARTI software already developed by IITK on SUN system was used in this project and was implemented (re-engineered) on PC under Linux jointly by IITK and ERDC under the supervision of Prof. R.M.K. Sinha.

In 1995-96, IITK also designed and developed an Example-based approach for Machine Aided Translation for similar (Indian languages) and dissimilar English and Indian Languages under the leadership of Professor R.M.K. Sinha. This approach has been named as ANUBHARTI approach. Currently, AnglaHindi, the English to Hindi MAT based on Anglabharti methodology, which accepts unconstrained text, has already been made available to the users and is very well received. HindiAngla, the Hindi to English MAT based on Anubharti methodology, has been demonstrated for simple sentences and further work is going on to handle compound and complex sentences.

IIT Kanpur in association with the Technology Development for Indian Language (TDIL) Programme of Govt. of India has recently taken an initiative to make the AnglaBharti Technology available to all the thirteen Resource Centres in the country. These resource centres have been established across the country for development of Indian languages technology solutions in their regional languages. These centres will develop MAT systems from English to their assigned languages using AnglaBharti technology.

11.4 C-DAC Programme.

Under the knowledge-based computer systems project of the DOE, C-DAC developed VYAKARTA, which could parse English and many other Indian languages. As indicated above, it uses the same parser to develop MANTRA (a machine assisted translation tool for translating official language sentences from English to Hindi). The same was demonstrated to the Department of Official Languages who financed the project entitled `English to Hindi Computer assisted Translation System' for administrative purposes. The aim of the project was to design, develop and implement a computer assisted translation system for personnel administration. The system is now able to translate letters and circulars such as appointment letters and transfers and is also capable of taking inputs from standard Word processing and DTP packages. The system so developed will be available as open source software. It is connected to a server which can be accessed by anyone on the Internet using a browser. All that user has to do is to send the English text and the server sends back the translated text in the language requested. C-DAC is also working on a domain specific translated chat application. Here, one can select the language and all the communication will be done in the selected language. This means that even if one selects Hindi and the other person selects English, he/she will receive all messages in Hindi although the other person types in English.

Creating Enterprise Machine Translation Systems

Winfield Scott Bennet has laid down certain guidelines to be met by EMT system.

1. The first approach is to determine the user-group and find out from the representatives of the group just what they need and want from a MT system. An EMT system must meet the needs of the user group otherwise it will not be accepted at all. MT is not an essential technology. It is better if the user group be a part of the process of development of such a system.

2. If an EMT is to be used for a particular domain, banking or insurance or agriculture, the amount of time to create a system is considerably reduced because the system dictionary does not have to cover the terminology and the grammar does not have to parse those structures for other domains. This is particularly attractive for the fields which have developed or adopted controlled language in their documentation such as medical or legal fields.

3. All MT systems are fueled by the dictionary data. Even in a system with a large dictionary, there will still be an immediate need to add to the dictionary. A customized or customizable dictionary is essential to the success of the application. Unless the developer is prepared to upgrade the dictionary for each client constantly on demand, the system must include a way to build and maintain the dictionary. This type of work is well suited to the computational lexicographers. A development effort should include one or more computational lexicographers whose voice in the matter of the dictionary should be fully heeded.

4. MT developers too often think entirely in terms of the technology of the system. The result may be that great technology which simply does not sell. In the world of EMT systems, no one buys technology as such. Therefore, they must be a PRODUCT. The hard fact is that people will use "easy to use and less than perfect" systems over "hard to use and outstanding" systems. The EMT systems must provide the users all the functionality that they need.

The plan for developing the EMT system must include plans for the best possible interfaces and tools to meet the needs of the user. Such systems are not 'plug and play'. One has to be trained to use such systems and for that customer support is necessary. It could be in two forms: documentation and human interaction. While the documentation should start very early in the development process of the system and be continually updated throughout. The EMT systems will not succeed if the customers cannot use it to their satisfaction. And their satisfaction rests with the customer support.

If these points in the development of EMT systems are fully addressed to in the planning stage of the development effort, the resulting system should have a reasonable chance of success. Keeping into view the objectives of the Machine Aided Systems or Machine Translation System from English to Hindi, these do fall under the category of Enterprise Machine Translation systems but they have not tried to meet the above requirements. On completion of the project, it has not been intended by the funding agency to evaluate whether the Systems meet their stipulated objectives. In order to keep pace with the developments in international arena, the systems earlier planned as domain-specific have been put on the Web whereas the system on the Web do not have to meet the requirement of 'translation-tool' rather they are used as 'communication-tool'. The output of these systems is far from satisfactory. It still needs lot of work to improve the output.

A Multi-disciplinary Enterprise

Machine Translation is a multi-disciplinary activity and an EMT, a much more. There is an established role of Computer Programmers, Computational Linguists, Computational Lexicographers and expert Translators. Moreover, this enterprise is primarily a linguistic enrterprise. The Linguist has to anaylse the language the source and target languages and on that parser or generator are to be developed by Computer Programmers. These Parsers and Generators are to be continuously tested and the problems are to be removed by the developers in consultation with the language expert. There is a need for constant testing , debugging and upgrading by the team of developers.

The bilingual transfer Dictionary has to be developed by language experts including Lexicographers and the Computer Scientists have to develop a program for the dictionary design and development. The translation equivalents of a particular word are to be heirarchised according to the frequency of their usage. The disambiguation techniques are to be worked out first linguistically and then computationally.

The Phrasal Dictionary is very important for the success of a MT system. It is again to be developed with the active support of the Linguist and Lexicographer. In all, the linguistic part of such an enterprise is more than 70% and the computational part comes to 30%. The success of a system depends on the sound linguistic inputs in the form of language analysis, formulation of rules, testing of the output and reformulation of rules in case clashes occur.

The Corpora - especially Parallel Corpora can first be linguistically analyzed for the development of Phrasal Lexicon, which is very helpful for the success of an EMT system.

Test Battery based on Health and IT domains.

In order to test the efficiency of MAT system developed by IITK and ERDCI Noida, a test battery of about 50 sentences has been prepared. The sentences have been translated from, the Web. The System could not take a paragraph or few sentences in one go. One sentence was translated and the resulting translation has been appended below its English sentence. The tested sentences are appended in the Annexure for reference.

Some Observations

From the resulting translation it is observed that the System developed needs a lot of revision and updating. It is a general observation that the System cannot handle a simple sentence with more than two prepositional phrases. There appear to be some shortcomings at the Parsing, as well as at Generation level.

Normally a team comprising of Computer Programmers, Linguists, Lexicographers and Translators is formed for working on an EMT system whereas the Indian experience is quite differnet. Most of the projects have been accomplished by Computational Institutes and they have failed to associate the linguists. There is no lexicographer or active translator associated with such enterprise. The output reflects this shortcoming i.e. the absence of an inter-disciplinary team. Normally a data entry operator coming from the Hindi area is supposed to perform the function of a linguist.

Suggestions

The funding agency should have a periodic evaluation of the project and its development should have to be monitored from the points of view of:-

(1) Validation of the Lexicon. (2) Parsing of the sentences. (3) Hierarchy of the dictionary-entries. (4) Evaluation of the Phrasal Dictionary content.

If possible, the funding agency should have an In-house Evaluation Team comprising of the Inter-disciplinary Experts mentioned above. Moreover, before the completion of the project is accepted a thorough appraisal of the project should be done, keeping into view its objectives and the targets achieved.

Conclusion

Marrying of languages with the advancement in IT has ushered in a new information age where the globe has shrunk and information has become the most valuable commodity responsible for progress and prosperity.

Due to the liberal funding from the government, the IT, R&D and applications for Indian languages have multiplied manifold in last two decades.