XI. Technology

The Department of Information Technology initiated the TDIL (Technology Development for Indian Languages) Mission with the objective of developing Information Processing Tools and Techniques to facilitate human-machine interaction without language barrier; creating and accessing multilingual knowledge resources; and integrating them to develop innovative user products and services.

The development in Language Technology in Indian languages and especially in Hindi Started with the establishment of the Mission for the Technology Development in Indian Languages (TDIL) under the Department of Electronics in 1991. Thereafter, a lot of activities under the Mission were started which could be described under different heads as under:

A. CORPUS AND CORPUS MANAGEMENT TOOLS

Corpus of any language is an assorted collection of text words of written texts. Machine readable corpus is therefore a compilation of such texts, which can be stored, manipulated and retrieved as and when required with the help of computer. The steps involved in building a corpus and selection of texts, data entry, data validation and a set of tools for management and retrieval of data.

Corpora can be used in a wide variety of applications since it provides authentic data on contemporary use of Indian Languages to the following category of users:

Linguists working in the area of standardization, pedagogy, lexicography, translation, linguistic analysis such as morphological analysis, syntactic/semantic analysis sentence generation etc.

Computer Scientists working in the area of machine translation, utility software development such as building of Electronic Dictionaries, Computational lexicon, sentence analysis and generation, spell checker etc.

As a test bed most for most of the ILP applications, tools and solutions etc.

Considering the richness of Indian Languages, it was decided to develop a corpus of three million words in each of the fifteen constitutionally accepted languages including Hindi in 1991. Accordingly the development of Hindi Corpora was entrusted to IIT Delhi.

The sources of Hindi corpora are printed books, journals, magazines, newspapers and government documents published during 1981-1990. It has been categorized into six main categories viz. Aesthetics, Social Sciences, Natural, Physical & Professional Sciences, Commerce, Official and Media Languages and Translated Material. Software Tools for word level tagging, Word Count, Letter Count, Frequency Count have also been developed. The Tag Sets developed by ERDCI, Noida and IIIT. Hyderabad consists of Finite Verb (FV), Non-finite Verb (NV), Noun (NN), Pronoun (PN), Adjective (Adj), Adverb (Adv), Indeclinable (ID). Corpus Manager and KWIC Concordance Software have also been developed. About thirty lac words of machine readable corpora have been developed in Hindi. Statistical Frequency Counting of various linguistic elements in the corpus has been accomplished. It has been cleaned and POS tagged by ERDCI, Noida and KHS, Agra during 2000 with the help of Morphological Analyzer developed by the NLP group at IIT Kanpur and later on at University of Hyderbad. This corpus is now being maintained by Central Institute of Indian Languages, Mysore. Software Tools for word count, Frequency count have also been developed.

KHS Agra, in collaboration with the Central Institute of Indian Languages, Mysore has undertaken the development of Hindi Corpora of 20 million words for Hindi. The work of POS Tagging, Chunking and Syntactic Annotation is being carried out at IIIT, Hyderabad with a view to utilize it for the development of Machine Translation System from English to Hindi.

Parallel Corpora (English-Hindi) is also collected by various organization, Web Duniya, IIIT, Hyderabad under is project of Development of Lexical Resources and Central Institute of Indian Languages. CDAC, Noida had also undertaken the development of Parallel Corpora among Hindi and Indian Languages, under its Gyan Nidhi project on the basis of books published by National Book Trust. It is only a collection of Corpus. Tagging, Annotation and Text Processing tools have not been developed for this Corpora.

Text Alignment tools are under development for Parallel Corpora (English-Hindi) at IIIT, Hyderbad.

B. Text Editors and Word Processors

Hindi Word Processors

Hindi Word Processors have been developed by various Organization Starting form Siddharth (DCM in 1983), Lipi (Hinditronics 1983). ISM, lleap, Leap Office (CDAC, Pune) since 1991 under development of GIST, Shreelipi, Sulipi, APS, Akshar and others so many word processors for Hindi.

GIST (Graphics and Intelligence Based Script Technology)

CDAC Pune pioneered the GIST Technology which facilitates the use of Indian Languages in Information Technology. It uses the Indian Script Code for Information Interchange, their representation on Screen and printer using the special fonts (ISFOC), common keyboard layout for Different scripts (INSCRIPT).

Gist has played an important role in some projects of national importance such as Land Records, Election I-cards, Indian Railway Ticketing etc. Based on this technology, a number of products have been developed, starting from Gist cards, and gist terminals to Gist has developed a number of utilities such as Spell Checkers, Thesaurus, OCX controls. Gist is now geared for the Internet enabled world where all activities are gradually going online.

iLEAP is an Internet ready Indian language Word-Processor on windows. It has following features:

Slef-explanatory user interface
Multilingual Spellchecker
Choice of keyboard layout
Email facility for Indian languages
Facility to make web pages in Indian languages
Language sensitive multilingual editor

www.cdacindia.com/html/gist/products/ileap/asp

Hindi Fonts

There is a multiplicity of Hindi Fonts, important among them TT Yogesh, TT Surekh, Shusha, Krutidev, Chanakya, Mangal etc. The multiplicity of fonts is the biggest problem, since a document written in one font is not readable by others having different fonts at this computer. All these fonts are ISCII fonts. Now Unicode fonts have been developed but still are to be adopted by different developers. DTP tools are also available for Hindi. The publishing industry is using Krutidev and Chanakya fonts.

UNICODE

Unicode is increasingly being acceptances a standard for information interchange worldwide as most of the major IT companies have developed their support for it. Unicode for Indian languages use ISCII-88 and not ISCII-88 which is the latest standard. It was felt necessary that Indian Government should represent UNICODE consortium for necessary modification in the code pertaining to Indian languages script and hence Department of Information Technology became full member of Unicode Consortium with voting right.

16 Bit (2 Byte) UNICODE standard is the Universal character encoding standard, used for representation of text for Computer Processing. Unicode standard provides the capacity to encode all of the characters used for the written languages of the world. The Unicode standards provide information about the character and their use. Unicode Standards are very useful for Computer users who deal with multilingual text, Business people, Linguists, Researchers, Scientists, Mathematicians and Technicians. Unicode uses a 16 bit encoding that provides code point for more than 65000 characters (65536): Unicode Standards assigns each character a unique numeric value and name. The Unicode Standard and ISO10646 Standard provide an extension mechanism called UTF-16 that allows for encoding as many as a million. Presently Unicode Standard provide codes for 49194 characters.

Microsoft Word 2003 has come up with Unicode compliant Hindi font TT Yogesh with Inscript keyboard.

C. Dictionary Tools

English to HINDI Dictionary (Shabdanjali):

This is an English - Hindi Dictionary developed by International Institute of Information Technology, Hyderabad.

	Address: (www.iiit.net)

English to HINDI Dictionary:

Developed by Banasthali Vidyapeeth, Banasthali under AKAMKSHA project.

Morphological analyzers/generators

Morphological analyzer tool takes a derived word as input and separates it into the root and its corresponding suffixes. The function of each suffix is indicated, It has two modules – noun analyzer and verb analyzer.

Morphological generator does the opposite of the morphological analyzer. It generates words. As morphological analyzer, it also has two modules, noun generator and verb generator. In the Noun generator, a root noun, plural marker, oblique from, case marker and postpositions are given as inputs. In the verb generator, a root verb, tense marker, relative participle suffix, verbal participle suffix, auxiliary verb, number, person and gender marker are given as inputs.

Morphological Analyzer and Generator were developed by the Aksharbharati Group while developing Anusarakas from Telugu, Kannada, Bengali, Marathi and Punjabi to Hindi at CALTS, Central University, Hyderabad and later on at IIIT. Hyderabad. (Itrc@iiit.net)

Hindi word Net

Semantic net of Synonym sets with corresponding roots, etymology, lexical category, and inter – sunset relationships. Relationships based on Princeton Word Net, with augmentation. more than 10,000 synsets have been developed under the project. This corresponds to about 30,000 words in Hindi. The project is underway since 1996. STATUS: underway.

Hindi Wordnet is an online lexical database. The synonym sets can serve as an unambiguous differentiator of the two meanings. Each sense of a word is mapped to a separate synset in the wordnet. All word senses are linked by a variety of semantic relationships such as hypernymy / hyponymy, meronymy / holonymy etc. The lexical data is stored in a MySQL database. The web interface gives the facility to query the data and browse the semantic relations that exist in the database for the search word. There is a morphological processing module as a front end to the system. (www.cfilt.iitb.ac.in).

D. Spell Checkers / Grammar Checkers/ Style Checkers

The first spell checkers for Hindi was developed by CDAC Pune in 1992. It is a Manual spell checker of Devanagari Script. It allows users to search documents created in Indian language for spelling errors. This Spell Checker utility points out possible errors which you can correct or choose to ignore. Spell check can be done on the entire document or on a selected portion thereof. This utility is supported for Hindi language documents with true type fonts only.

Spell Checker points out possible spelling errors in documents as shown below. The Utility also provides possible suggestions for correcting the error word in Spelling Suggestion. The user can either choose a suggestion to correct an error word or can use the Change To box to enter a different replacement word.

No Grammar checker and style Checker have been developed by any agency so far.

E. Prasing Systems

Hindi Parser was developed by Aksharbharathi group at IIT Kanpur under the project of Translation among Indian languages. Under that project, it was envisaged to develop Parsers and Generators for each scheduled Indian languages. Later on this project was restricted to the development of Anusarakas from five Indian languages to Hindi.

Similarly the work on Hindi parser has been undertaken by the Anglabharati group at IIT Kanpur for the translation system from Hindi to English, under development at the same Institute.

Hindi Parser has also been developed by ANUVADAK (Hindi to English) by a Private company, Super Infosoft, New Delhi. (www.mysmartschool.com)

F. Machine Translation

Machine translation is an important application and has immense potential in the Indian market. There being twenty-two languages in the country, translation from one language to another would yield a large number of pairs. Keeping in view the maximum correspondence in the language pairs, and keeping in view the need for translation of official correspondence from English to Hindi, this pair has been identified as the priority area for MAT system.

Because of similarity among Indian languages, the translation among Indian languages has been considered to be easier than translation from English to Hindi. In view of the above, two areas for MAT, namely: MAT systems for translation among Indian languages and MAT for translation form English to Hindi have been identified as potential areas for research. Because of the complexity of the area, it has been considered feasible to develop only domain specific systems for narrow domains.

Machine Aided Translation (Indian Languages to Hindi):

Demo system for a language pair Kannada to Hindi was developed initially at IIT, Kanpur and this technology was demonstrated at various forums and was termed as ANUSARAKA. This technology has now been extended to Telugu, Marathi, Bengali, Punjabi into Hindi and is available for trial through e-mails. This work has been carried out jointly by IITK & University of Hyderabad, Hyderabad. Anusaraka technology aims at providing access to any other Indian language to a person who knows Hindi. It is particularly important as the content in Indian languages becomes available on the web or in digital form. It is jointly developed by IITK and University of Hyderabad, Hyderabad.

MAT system for Translation of Standard documents from English to Hindi:

The documents / reports used for the campaigns of Public Health are mostly in English language. Translation of these documents in Hindi will go a long way in order to achieve the objectives of the respective campaigns. The System uses the Anglabharati approach developed at IITK. A demo system for translation of Public Health Campaign documents has been developed. Keeping the above in mind, two projects for E-H pair and one project for other Indian Languages to Hindi were initiated for specific domains.

ANGLABHARTI represents a machine-aided translation methodology specifically designed for translating English to Indian languages. English is a SOV language while Indian languages are SVO and are relatively of Free word-order. Instead of designing translators for English to each Indian language, Anglabharti uses a pseudo-interlingua approach. It analyses English only once and creates an intermediate structure with most of the disambiguation performed. The intermediate structure is then converted to each Indian language through a process of text-generation. The effort in analyzing the English sentence is about 70% and the text-generation accounts for the rest of the 30%. Thus only with an additional 30% effort, a new English to Indian language translator can be built. Some of the major design consideration wherein an attempt is made to get 90% of the task done by the machine and 10%laft to the human post-editing;

a system which could grow incrementally to handle more complex situations;
an uniform mechanism by which translation from English to majority of Indian languages with attachment of appropriate text generator modules; and
a human engineered man-machine interface to facilitate both its usage and augmentation.

Anglabharati is a pattern directed rule based system with context free grammar like structure for English (source language) which generate a ‘pseudo-target’ applicable to a group of Indian languages (target languages). Aset of rules obtained through corpus analysis is used to identify plausible constituents with respect to which movement rules for the ‘pseudo – target’ is constructed. The idea of using ‘pseudo – target’ isprimarily to exploit structural similarity to obtain advantages similiar to that of using inter-lingua approach. It also uses an example-base to identify noun and phrasal verbs and resolve their ambiguities.

Paninian framework based on Sanskrit grammar using Karaka (similar to ‘case’) relationship provides an uniform way o designing the Indian language text generators using selectional constraints and preferences. The lexical database is the fule to the translation engine. A number of ontological/semantic tags are used to resolve most of the intra-sentence anaphora/pronoun references. Alternative meanings for the unresolved ambiguities are retained in the pseudo-target language. A text generator module for each of the target languages transforms the pseudo target language to the target language. These transformations do lead to sentences which may be ill-formed. A corrector for ill-formed sentences is used for each of the target languages. Finally, a human- engineered post-editing package is used to make the final corrections. The post-editor needs to know only the target language.

The ANGLABHARTI methodology was used to design a functional prototype for English to Hindi on Sun System. Feasibility on extending this for English to Telugu/Tamil was also demonstrated. Thereafter, during 1995-97, the DOE/MIT TDIL program funded a project for porting the English Health Slogans into Hindi. ER & DCI Lucknow / Noida was associated with the project for field testing and packing the software. In year 2000, the project received, Further funding for making it more comprehensive. The outcome of this project has been release of the first version of the software named Anglal-Hindi (an English to Hindi version based on its form. Anglal hindi has also been web-enabled and is available for on-line translation at URL:http//anglahindi.iitk.ac.in

Angla Hindi software technology has been transferred to two organizations and is being made available on both the Linux and Windows platforms. (http://anglahindi.iitk.ac.in)

Anubharati Technology for Machine Translation among Indian languages

Under this project, Hindi Angala Machine Translation System form Hindi to English for simple sentences is under development t IIT Kanpur. The potential beneficiaries of this system could be all Govt./semi-Govt.organizations/Parliament/SteAssemblies/judiciary and all commercial organization. (www.cse.iitk.ac.om/users/langtech)

Web based translation service for English news stories to Hindi- NCST, Bombay

The domain of news stories is highly context sensitive, hence the standard approaches of translation such as Direct Translation, Transfer Approach, Interlingua are not adequate. Therefore a Hybrid approach, system Vaakya has been developed at NCST, Bombay. The input text is simplified using a pre-processor. Using the word knowledge and heuristics, the topic of the news story is identified. The processed text is analyzed and tagging of the parts of the speech is done. Langthy sentences are simplified using simplification rules. The text is then transformed into a case-frame like structures using the infitization rules. Then generation of the target language is achieved by the parameterized templates form the case-frame structures and the bi-lingual Lexicon. The major components of the system are (Block Schematic Diagram): Prototype Vaakya system is now being enhanced and adapted for providing web translation service to the news Agencies. (www.ncst.ernet.in/projects/indix/)

MANTRA Machine Translation system for Officialese Domain – C-DAC, pune This project has been funded by department of Official Languages for specific domain of Government of India Appointment Letters. The system is currently being tested at five ministries. The system uses Tree Adjoining Grammar (TAG) proposed by Shri Aravind Joshi in 1983 in the University of Pennsylvania, USA. TAG is a Tree re-writing system uses a TAG based parser called VYAKARTA. This parser uses sub-language concept for it’s definition, is capable of parsing about 250 tree families in English, Hindi, Gujarati and Sanskrit. A comprehensive lexicon is built using TAG for various complex phrases. Then a transfer lexicon which contains the lexical structures/trees for both the source language and the target language is used to get the TAG formalism in the target language which is equivalent to the input sentence.

Anusarak (IIT, Hyderabad)

Anusaraka is a Language Accessor rather than a machine translation system in true sense. It helps in overcoming the language barrier by assisting the reader to access information from another language. Anusaraka analyses the source language text and presents exactly the same information in a language close to the target language. It tries to preserve information from the input to the output text. It is domain free system and has been adapted from Paninian Grammar. It has been developed for translation from Telugu, Tamil, Marathi, Bengali, Punjabi to Hindi. The components of the system are:

	Morphological Analyzer
	Local Word Grouper
	Bi-lingual Dictionaries
	Mapper from Source Language to Target Language
	Word Synthesizer
	Post-editing interface

The Anusaraka has been made available in the public domain as an E-mail server for translation service from Telugu, Kannada, Marathi, Bengali & Punjabi to Hindi. To run the Anusaraka on a given text, send the text by e-mail to nandi@anu.uohyd.ernet.in with the language name as subject such as ‘Telugu’ for getting the translation form Telugu to Hindi. This will automatically run the Telugu to Hindi Anusaraka and the output produced will be sent back to the sender. A copy is kept by the machine for later study. The text should be in 7-bit ISCII coding. Similarly help by mail is sent with subject ‘Help’. (tdil@mit.gov.in)

SHAKTI MACHINE TRANSLATION SYTEM

Shakti Machine Translation System has been designed to develop MT from English to different languages of the world. It is presently working for English-Hindi, English-Telugu and English Marathi. Shakti MT kit will allow to develop an MT system from English to other languages rapidly by providing only their language data.

Shakti uses statistical and rule-based approaches for processing language. It uses constituent structure at the chunk level and dependency relation at the sentence level of analysis. It has Specialized components forward sense disambiguation, parsing, preposition attachment, phrasal verb identification, transfer grammar, sentence and word generation.

The architecture of Shakti is highly modular. The complex problem of MT has been broken into smaller sub problems. Every sub-problem is a task, handled by an independent module. The modules are put together, using a common extensive representation using trees and feature structures, called SSF (Shakti Standard Format). The input sentence is first chunked and morph analyzed separately. After this, the output of the morph analyzer and chunker are combined and represented in the SSF.

The system has been designed to be user-friendly. Modules can be plugged and de-plugged. SSF itself is a highly readable transparent format for linguists and computer scientists alike. Inputs and outputs of all the modules are available for inspection for developers, so that he could look and analyze, where it needs improvement. A special care for the robustness has been taken in the design of Shakti. If a module fails to perform its operation, the common format ensures that there is no immediate breakdown. For example, if it does not find an equivalent of an expression, it will give the transliteration of the English expression. (http://shakti.iiit.net)

G. Optical Character Recognition (OCR)

OCR facilitates automatic rendering of text aims at a better Human-Machine interaction which envisages the use of other modes of communication with the computer namely: speech and vision. These systems are useful for direct transfer of large data into the computer with the scanner. The hard copy is fed into a scanner, the image file is then analyzed and image processing are applied for transforming into ASCII file. The OCR software can be integrated with the Hindi Speech Synthesis system to make a text to speech system in Hindi. It can be used as a front end for a Machine Aided Translation System.

Work on Devanagari OCR was carried out at IITK under the project Devadrishti, on recognition of hand printed Devanagari script. And ISI Kolkata have developed OCR Technology for Hindi. The technology has been transferred to CDAC Pune and Noida for developing an OCR for Hindi. Chitraksharika (OCR for Devanagari script) has been developed by CDAC Nodia. The technical attributes of this OCR are:

It recognizes Hindi, Marathi and Nepali, scans images via TWAIN interface, has embedded spellchecker of Hindi and scans text and images in different formats. Its accuracy has been reported 96.8% in Vishwabharat (Oct, 2003 issue) (www.cdacnoida.com)

Various image processing algorithms have been developed for obtaining the image matrices of the characters and identifying the Devangari characters and words for laser printed text. This development is being carried out at C-DAC, Pune. OCR software has been developed with following input design parameters:

Laser printed raster scan image file usin 300 and 600 dpi resolution
DV – Natraj and DV – Surekh Fonts
Font Size 12, 14, 16, 18, 20, 24, 28, 32, 36, 40 & 48
Scanning at 300 & 600 dpi using HP Desk Scan – 4C Scanner, black & white, 8bit, bmp file format.

Shiro-rekha based approach has been adopted. Based on which, Algorithms have been devised for segmentation, tilt correction and Feature Extraction to generate three sets of databases namely Top, Middle and Bottom modifiers. Approximately 95% accuracy has been obtained.

The technology has since been transferred to Pyramid Cyberway Pvt. Ltd. New Delhi.

Displaying Web Documents through Negotiation and Dynamic Rendering

Web authors create documents in a variety of languages using a variety of character sets and fonts. It is not possible for the viewer of the document to have all those fonts and character sets present on his system. Thus either the client is required to download the fonts or install these on his/her system or install some software on his system to help in the process. For a truly portable solution, the client need not specially install any fonts or software on his system.

The Java-centric solution for displaying the Devanagari documents has been developed. Java Applet using public domain font. The applet and the font related information is around 100k. The Server Java Applet encodes the glyphs and sends it with the document so that Hindi Fonts is not required at the client side for browsing the document. Hindi Search engine has been developed on Linux platform.

A prototype pocket Translator for Hindi with 30,000 bi-lingual dictionary and 300 phrases has been developed to establish the feasibility of technology.

SPEECH DATABASE

Speech signals for 1000 selected words of Hindi have been recorded For two utterances of about 50 speakers. PC based algorithms have been designed for speech signal processing, acoustic analysis of speech sounds and phoneme labeling. An acoustic phonetic database for these words has been created. This is useful for speech synthesis and recognition. This has been developed at CEERI, Delhi.

HINDI ENCYCLOPEDIA (Vishwakosh)

Hindi Encyclopedia consists of 12 volumes published by Nagari Pracharani Sabha During the1960’s. It consists of 12 volumes of data covering almost all the details of 1500 topics from various fields of life. The task of digitizing the information and putting it on Internet was assigned to ERDCI, Noida and Kendriya Hindi Sansthan, Agra by MIT and MHRD as a joint project. The information given in the Encyclopedia is now made accessible in alphabetical order and by way of categories. The storage code is ISCII. The system can be expandable to other domains/lexicons and can be integrated with various windows based word processors. The technology has since been transferred to Pyramid Cyberway Pvt. Ltd., New Delhi.

Automatic Font Installer:

The users are required to generally download the fonts for viewing the non-roman web sites or manually install the fonts, which is a cumbersome task. A single executable has been developed to carry out the process whenever the user chooses the font installation option. The font installer program runs on the server and installs the fonts on the client machine.

Indian Language Search Engine:

The search engine should allow indexing and searching of Devanagari HTML documents. The basic components are gather, indexer, and Search Processor. Indexer and Search processor are being designed as these two modules deal with syntax and semantics of the language of the text. Indexer will perform processing such as keyword extraction, stop word removal, stemming (handing different forms of a word) , handling of word synonyms, and term weight calculation. Search Processor looks up the index to find the documents containing the query keywords, calculate a relevance score, and ranks them according to the score. This search engine will also search the keywords occurred in a composite word (combined according to ‘SANDHI’ rules, for example s/w will give a match for keyboard ‘ram’ if it finds ‘rameshwar’ in the document). It is assumed that the documents are in ISCII.

Web based E-mail:

Hindi e-mail service has been developed which uses advanced ActiveX technologies available with Internet Explorer 4.0 and later versions of browsers for enabling the keyboarding and fonts for Indian languages on the client PC. This service provides a facility to type the text in Hindi language for sending an e-mail in Hindi which gets converted into HTML format. This converted Hindi text in HTML and font codes is delivered at the Email address of the user, who can just place it on any Web Page using any standard HTML editor like Netscape Composer.

The software components namely ActiveX Controls and Hindi Fonts get downloaded and installed on the client’s computer when the user first time accesses the system. Every time the user accesses the e-mail server, a check is made for the installed components.

To be able to send/view mail, the user must first create an account on the system by defining login name and password. An account holder on his system can send a message to any other account holder. The message is stored in a database. The data on the server is stored in ISCII. When a request for reading an e-mail message is received, the server retrieves the message from the database and creates a HTML file containing the message with the Hindi fonts information on the fly and delivers it. Microsoft Visual Interdev is the IDE, which uses the power of ASP (Active Server Pages) to make web pages and connect to the back end Database using ODBC drivers for MS SQL server. Using ASP, queries have been made to the database from the webpage. ActiveX technology based Hindi e-mail, search engine and Bulletin Board System has been developed. Hindi e-mail also stores documents in ISCII format.

Hindi Bulletin Board System:

It is under development this web based application allows users to create topics for discussion and maintains threads within a topic.

Hindi Search Engine:

Developments is underway which involves the following: Manually Surfing the net and build indexes for documents in Hindi. Invite the Hindi language document creators to submit web pages URL with page description and keywords in Hindi to the search engine i.e. build a web based application to collect data in Hindi. Build special search techniques for Hindi based on word morphology/thesaurus/sandhi etc. Deliver HTML document index description in Hindi for search results. Define standards for Meta-tags etc. for Indian language such that future spiders can retrieve documents for a particular language.

COURSEWARE IN HINDI

DOEACC ‘O’ level courseware (BANASTHALI VIDYAPEETH)

DOEACC ‘O’ level courseware in machine readable form has been developed in Hindi. DOEACC is also financially participating in this project. Once completed, this material will be published in book form and with incremental efforts, it can be published as CD-ROM and can also be made available on the web. The four modules covered in the syllabus are Information Technology, Cobol, PC Software, Programming in ‘C’, Business System. Manuscripts have been reviewed by the Experts and modification based on their advice are being incorporated.

LILA (Learn Indian Language through Artificial Intelligence.)

It is a Self Tutoring System for Hindi. The LILA series developed with the support of Department of Official Languages. Ministry of Home Affairs Includes:

	LILA Hindi Prabodh (on DOS, Linux and Windows platforms) 
	LILA Hindi Praveen (on DOS, Linux and Windows platforms) 
	LILA Hindi Pragya (on Windows platforms)

These are self Tutoring Systems for HINDI, especially designed for Government departments, Banks, Public Sector Undertakings and Corporate employees. These packages are also useful for other non-Hindi speakers who wish to learn Hindi from the basic to the advanced stage. These packages aim at imparting basic functional knowledge of Hindi and its core functional grammar.

J. Speech Technology

1. Text-to-Speech Synthesis System for Hindi, Speech Recognition and Voice Synthesizer (Hindi Vani):

It is a PC based Unlimited Vocabulary Text-to-Speech Conversion Software for Hindi The text document is generated using a Hindi Editor which supports ISCII standard. The input words are spilt into syllables, using a parser. An acoustic-phonetic database of all these syllables is available in the database, which is subsequently used to create words. The concatenation of syllables into words and the superimposition of quality features is done by developing rules. The quality features of speech such as intonation, stress and timing patterns are then improved. A cascade-parallel formant synthesizer is used to synthesize the speech. The major components of the system are (Block Schematic Diagram) The system can be used with any Pentium machine with Windows Operating System and multi media facility. It is a useful product for Visually handicapped people, information retrieval in spoken form, Text Reading Machine. The technology has been developed at CEERI, New Delhi and transferred to Utkal University, Bhuvaneshwar, Aarkay Computer Research Foundation, New Delhi and Expert Software Consultants Pvt. Ltd., New Delhi.

2. Text to speech Synthesizer-Vaachak

Prologix Software Solutions Pvt. Ltd. Has Developed a Text to Speech Synthesizer for Hindi, Vaachak. It has the ability to transform written text into a cleat audible speech. Based on the latest speech technology, it provides the following features for implementation:

3. Hindi Speech Recognition System

This is a Hindi Speech Recognition System for a large vocabulary speaker- independent dictation task. In this, the computer first learns the sounds of spelling in various contexts, which is called a training phase. After this the system is ready to recognize the speech. The system can be customized and statistical techniques are used to make the recognition robust to speaker speech variations and to make it work on continuous speech of a large vocabulary.

This technology has been developed by IBM India Research Lab, New Delhi and transferred to Apple soft (Software development doe Indian Languages) Bangalore.

4.

Hindi Dictation and PC control is a dynamic speech recognition system for two languages Hindi and Indian English. It is capable of taking continuous dictation controlling PC functions in both languages by simple phrases and commands which can be extended to any limit like telephone-PC communication, voice query system and lots more using VB, VC++ application. It uses L@H Dragon Naturally Speaking professional platform.

Many speech applications can be mad with business potential using Hindi/English speech recognizer. This technology has been developed by Megasoft, Haryana and transferred to HCL Infosystem Ltd., Noida, Applesoft, Bangalore and ER&DC (1) Kolkata.

5. English (Text) to Hindi (Text) SMS. (Under development)

This would allow a user to send SMS In English and the same will be delivered in Hindi. The SMSC will, on receiving the message, pass on the same to Anglal Hindi translation the message the server will send the message to the SMSC and the same will display o the cell phone, if the architecture of the cell phone supports this. Most of the cell phones are now able to support the picture messages. The key issue is the Hindi Font support.

6. English (Text) to Hindi (Speech) SMS. (Under development)

This solution would enable the SMS sent in English form a subscriber’s phone, translated into Hindi at the SMSC and subsequently delivered as a voice message.

K. Standardization Issues

Character level standards: ISCII/UNICODE

ISCII (Indian Standards Code for Information Interchange0: In 1991, the Bureau of Indian standards adopted the Indian standard code for Information Interchange. A standardization committee under Department of Electronics during 1986-88 evolved the ISCII standard. For an introduction to ISCII and ISCII code table visit the site: http://tdil.mit.gov.in/standards.htm

Unicode Standard

The Unicode Consortium develops, extends and promotes use of Unicode standard which specifies the representation of text in modern software products and standards. Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re- engineering.

Indian Languages on Unicode standard has incorporated Indian script under the group named Asian Scripts. This includes Bangla, Devagari, Gurumuki, Oriya, Gujarati, Tamil, Telugu, Malayalam and Kannada. The Indian language block of Unicode standard is based on ISCII-88. Address: http://www.unicode.org

To view chart of Indian languages character sets in Unicode, Visit: http://www.unicode.org

Glyph standardization
Keyboard layout

IIKEYB:

This is Indian language keyboard Program developed by Avinash Chopde. The software facilitates typing text in any Indian language script by memorizing only 50-60 keys. The user needs to remember only basic vowel and consonants of any language and the program automatically generates the 200+ characters (glyphs) required to correctly typeset text in any Indian language. It contains a high quality true type font (developed by Shrikrishna Patel) and a software module that run software maps the ASCII English keyboard to a Particulars Indian Language Script. For more information visit www.aczone.com/ilkeyb

INSCRIPT

This keyboard layout in used for data entry in Indian languages. The layout uses default 101 keyboard. The mapping of the characters is such that it remains common for all Indian languages. All the vowels are placed on the left side of the keyboard layout and the consonants on the right side. For more information visit www.cdacindia.com/htmlgis/standard/inscript.asp

Rendering engines
Operating System level support

To view chart of Indian language character sets in Unicode, Visit: http://charts.Unicode.org

Glyph standardization
Keyboard layout

Automatic Generation of Concept Dictionary and Word Sense Disambiguation:

A Concept Dictionary is a repository of language independent representation of concept using special disambiguation constructs. A system has been developed at IITB for automatically generating document specific concept dictionaries in the context of language analysis and generation using the Universal Networking Language (UNL).

Hindi analysis and generation

Hindi Analysis:

It uses the UW Hindi Dictionary and the analysis rule base. The dictionary contains the headword, universal word and its grammatical and semantic attributes. The analysis system can deal with almost all of the morphological phenomenon in Hindi.

Hindi Generation:

It also uses the UW Hindi dictionary and the generation rule base the generation rule is formed from the grammatical and semantic attributes as well as the syntactic relations. Matrix based priority of Relation has been designed to get the correct syntax. Extensive work has been done on morphology.

Knowledge Resources

Under this project, Online dictionaries have been developed as under:
English-Hindi-Malayalam Dictionary (www.malyalamresourcecentre.org)
Telugu-Hindi Dictionary (www.languagetechnologies.ac.in)

Localization of the Operating System

Enabling Localization of Linux Open Source Software for Hindi: Index The goal of this project is to provide system level support for Hindi Language in GUI of Linux operating system such as that most of the existing application can work with Hindi language without any modification and recompilation. (www.ncst.ernet.in/projects/index/).

top