(Under each head the various approaches/models used, application, current state of Technology, software tools available, future directions etc. will be given along with relevant references to organizations /institutions / companies, products, experts in the field, books, formals, articles, websites, CDs cassettes etc.)
1. General, Tagged Corpus, Parallel Corpus, Aligned Corpora 2. Corpus indexing tools (Concordance, KWIC index etc.) 3. Corpus compression and encryption and encryption tools 4. Text processing tools 5. Statistical Analysis Tools
1. Text editing tools 2. Word Processing tools 3. DTP tools
ITRANS is a package for printing texts in Indian Language Scripts. It was developed by Avinash Chopde. The output of this package includes scripts such as Devanagari (Sanskrit/Hindi/Marathi), Tamil, Telugu, Kannada, Bengali, Gujarati, Gurmukhi, and Romanized Sanskrit. The input text to ITRANS is in a transliterated form, each letter in an Indian Script is assigned an English equivalent, and the English letters are used to construct what will eventually print out in the Indian Language Script.
Word Processing software helps to type or perform word processing, often in multiple languages from with-in a single application. Some packages support right-to-left languages such as Arabic, Hebrew or Urdu. Some have supporting input methods which are necessary for double-byte languages such as Japanese, Chinese or Korean. The word processing applications may or may not have spell checking for the language which it intends to use. If one language is listed as supporting the word processor, Spell Checkers may be included or provided as a separate module purchase. Some of the word processing tools are as follows:
1) Marathi Saral Klik-2-Type is a marathi word processor software.
2) Shree-Lipi 5.0: It is a script processor and collection of fonts with a number of utilities to be used in Windows95, Windows 98, Windows NT, Windows 2000, Windows Me and Novell Netware. The supporting languages in this processor are Devnagari (Hindi, Marathi, Nepalese, Konkani etc.), Gujarati, Punjabi, Bengali, Assamese, Oriya, Tamil, Kannada, Telugu, Malayalam, Sindhi, Sanskrit, Sinhalese, Russian, Arabic, etc.
3) iLEAP is an Internet ready Indian language Word Processor on Windows. The main features of iLEAP are :
a) Self explanatory User interface
b) Multilingual Spellchecker
c) Choice of Keyboard layouts
d) Email facility for Indian languages. Send Indian language messages in LP2, ACI RTF, HTML, BMP, and JPEG formats to enable use of Web browsers or any standard text editor to view these messages
e) Facility to make web pages in Indian languages.
f) Language Sensitive Multilingual Editor
g) User Definable Shortcuts to type frequently used words and phrases.
h) Search and Replace in Indian Languages
i) Carry Indian language text in RTF format to other programs for Graphic Enhancements and Pre-press Processing
j) Define Styles and Design Templates in any language
k) Choice of Keyboard layout
Website: www.cdacindia.com/html/gist/products/ileap.asp
adrishya = invisible agyāna = ignorance aikya = unity akasmāta = unexpectedly akhanḍīta = unbroken āni = and aṅgikārū = give shelter anna = food ảnta = end artha = money asā = like ati = excess āpalī = our ātā = now īsha = Lord uḍī = jump upēkshā = neglect eka = one kā = why karā = do kāy = what kele = made kharā = ass koṇa = who kripā = grace ghyāvī = take gūja = secret chāla = walk(act) chāṅgale = good jana = people jāḷū = burn jāṇatā = wise jāvā = go jiṇe = life te = that to = he tujhā = you tujhī = your tyāce = his daksha = attentive dāna = charity dilā = gave deha = body duṣṭa = evil dūri = far na = no nahī = not nishā = night nīḷa = blue pālaṭe = change parī = but pāve = bless bolije = speak mahī = earth makshikā = fly mājhe = my mānīta = accept mhaṇenā = utter mī= I mithyā= false yā = this yeta = come rahāve = live ritā= empty laksha = attention laṅḍī = coward lapāve = hide lokū = people vadā = speak vase = lives vāhaṇe = carry vāṇī = speech vikalpe = doubt viṣā = poison vīsarā = forget vegaḷe = different sattā = rule sāṅḍū = drop sāpaḍe = find shīkavū = teach shiḷā = stone shoka = grief shreṣṭha = great hā = this hāni = harm hīta = well-being
a) MaTra Lite–Fully Automatic On Line Translator, it is simple web based interface
It is developing Marathi text corpus in electronic form (including a Marathi dictionary and 10 Marathi classics).
c) IIT, Mumbai is a participant in Universal Networking Language (UNL) project. It is an international project of United Nations University. It is an interlingua for semantic representation. In this project, the input in the source language is enconverted into UNL and then deconverted from UNL to the target language. At present, work on Marathi, Hindi and English is going on.
http://www.cfilt.iitb.ac.in
http://www.cfilt.iitb.ac.in/wordnet/webmwn/ - Marathi WordNet http://www.cfilt.iitb.ac.in/wordnet/webhwn/ Hindi WordNet
These WordNets are compatible with English WordNet and Euro WordNet. There are 5521 synsets in Marathi WordNet and 11,312 in Hindi WordNet.
http://sanskrit.gde.to/all_txt/marathi-dict.txt
http://www.worldlanguage.com/Languages/Marathi.htm
http://www.cfilt.iitb.ac.in/ http://laiir.cse.iitb.ac.in/eng_unl_anal.html http://www.cfilt.iitb.ac.in/eng-hin-mt/
Spell Checking utilities are in various forms, including those which work with specific applications only, such as MS Word or Office. Other spell checkers will work using highlight or clipboard functions in most applications under Windows or Mac.
Grammar Checking is a utility, which is typically part of Spell Checkers that will check grammar, sometimes making suggestions. Spell Checkers may or may not support grammar in all the languages that are available for the spell checking.
Parsing is equivalent to extracting the underlying semantics of the expression by identifying parts of speech and their inter-relations. Indian languages are free word order languages. The role of a word in a sentence is defined by its morphology, world knowledge and in a limited way by its position.
Parsing free order languages deals with semantic parsing techniques. There are two modern techniques for semantic “parsing” of sentences - 1) the Unified Networking Language, and 2) the Paninian Grammar formalism. Unified Networking Language is aimed at machine translation, Paninian Grammar formalism is a lightweight method ideal for everyones needs. Indian languages are using the Paninian Grammar formalism for the work.
The problems which are presently being tackled are the development of a stable word grouper (which tackles noun phrases and compound verbs), and a clausal level parser for Indian languages.
The Morphological Analyzer is a part of Natural Language Processing system in the context of Indian languages. Indian languages, which have free order (like Hindi), i.e., the semantics, are dependent on the surface structure of the word. Morphological analyzers identify the structural components of a word and collect information about it.
For Marathi, the morphological analyzer identifies the tense, aspect, modality and person of an inflected verb form. For Hindi, gender and number may be identified as well. The root of the verb is identified by the analyzer. Morphological analyzer determines the inflection, suffixes and prefixes of the nouns. It also analyses the lexical word groups which corresponds to the noun and determines the semantic role.
Website:http://www.cel.iitkgp.ernet.in
The main aim of I.I.T., Mumbai is to empower the people of India through their use of Information Technology solutions in Indian languages. It is developing new products and services for processing information in Indian languages. It also conducts research in computer processing of Indian languages. Marathi is the Indian language being focussed in I.I.T.
It has created a web site on Marathi languages and Marathi language technologies. It is developing Marathi text corpus in electronic form (including a Marathi dictionary and 10 Marathi classics). It has produced Marathi portal complete with search engine, Machine translation software along with dictionary, Wordnet and Online textbooks for schools in Marathi. It has conducted researches on Machine Translation between Marathi on the one hand and Hindi and English on the other. It also introduced Speech Technology for Marathi. The output of I.I.T. is as follows:
1. Web site on Marathi language and Marathi language technologies 2. Training programmes & workshops 3. Electronic corpus of Marathi text. 4. Hindi Wordnet. 5. Marathi portal complete with search engine. 6. Machine Translation software along with dictionary. 7. Online textbooks in Marathi for schools.
Human Aided Machine Translation Tool from English-Hindi It is a Web based Subsystem for translating English to Hindi. There are two versions of the MaTra based on the amount of interaction expect from the user.
a) MaTra Lite–Fully Automatic On Line Translator, it is simple web based interface.
b) MaTra Pro- Professional Translators Tool with Auto, Semi-Auto and Manual Modes, GUI and Customizable lexicon.
They are helpful in Media News Agencies, Translation Bureaus and Educational Institutions involved in long distance and Online Education.
Machine assisted Translation Tool from English-Hindi. It translates the English text into Hindi in a specified domain of Personal Administration, specifically Gazette Notifications, Office Orders, Office Memorandums and Circulars. The strategy adopted in ManTra is lexical tree. The Mantra Technology is expanded for translating the English texts into other Indian languages such as Gujarati, Marathi, Bengali, and Telugu. They are useful for Translators, Linguists and Govt. Offices, Central Translation Bureau and other Translation Units.
It is a Language Accessor or a computer software which renders text from one Indian language to the other. It produces output which is understandable to the reader, although at times it might not be grammatical. Example: a Marathi to Hindi Anusaaraka can take a Marathi text and produce output in Hindi which can be understood by a Hindi reader, but which is not fully grammatical. The reader requires some amount of training for reading the output. Anusaarakas is built from Telugu, Kannada, Bengali, Marathi and Punjabi to Hindi. Beta versions of these languages are released for use over the internet as e-mail servers. The storage code for Anusaaraka is ISCII. It can be used in various scenarios. Example: A reader might be accessing a web site containing Indian language texts. He comes across a site of interest, and wants to read material on it. However, he does not know the language. He can run anusaaraka and read the text. Normally, the reader’s motivation is high and he is willing to put in some effort.
Marathi. Word Net Online helps to browse the Marathi Word Net database through HTML form interface. This web site uses Devanagari fonts.
It is a Software Product that works with the help of a Scanner. An User puts a piece of paper document printed in Devanagari (Hindi) script under the scanner, runs the OCR software and gets all the text from that document available inside the computer just as if it was typed in. The data is stored in ISCII code. The system is developed using C programming language. The technology can be used with LINUX platform. It can be easily ported to Windows platform. The OCR software can be integrated with a Hindi Speech Synthesis System to make a Text to Speech system in Hindi. It can be used as front end for a Machine Aided Translation System. It is used in Newspaper (printed in davanagari script) Houses, Libraries, Offices looking for office automation, Linguistic Community (for creating Corpus), Blind People, etc.
It is a Software Subsystem. Indian languages require composition. This Software Subsystem needs to substitute one string of key code sequence with another form of a conjunct character. Different Matras and exceptions are properly displayed as per language rules. The fonts are designed to support a large set of conjuncts. The fonts are available in True type format or Adobe type-1 format. At present, there are no acceptable standards on Indian language fonts. A variety of keyboard layouts are supported for each language. It can be installed in Windows 95 to Windows 2000. The Software Subsystem enhances Indian community to use Indian languages on PC.
It is a library of over 250 high quality fonts in all Indian languages. These fonts are available as PostScript Type-1, TrueType, Open Type or PFR for Web. These fonts are used by Software development companies which develop eGovernance applications for various Government departments. It is also used by State governments who would like to buy enterprise license which will meet all their text and data processing requirements.
Various image processing algorithms is developed for obtaining the image matrices of the characters and identifying the Devanagari characters and words for laser printed text. This OCR is developed by C-DAC, Pune.
In this system, synthesized speech will be generated after some steps by giving a sentence in text. In the first stage it will be analyzed by a Natural Language Parser. After Morphological and Phonological analysis, the grapheme string is converted to a phoneme string which can be directly mapped to the dictionary and concatenated.
The Unicode Indic ranges are based on the Indian standard ISCII (Indian Standard Code for Information Interchange, 1988). ISCII is a well-intended standard which comprise all the major Indian scripts. In its Devanagari ranges, it tries to capture all the characters in any Indic Script. For example, while there are no equivalent characters in native Devanagari script for letter 'ZHA' which is available in Tamil and Malayalam, (in Unicode, TAMIL LETTER LLLA (U+0BB4) and MALAYALAM LETTER LLLA (U+0D34)), ISCII, and Unicode defines a dummy character for this letter. The idea is to have atleast one script which can represent all the Indian languages in a less way. ISCII is also meant to be a symmetric encoding, in the sense, if one has a text encoded in ISCII, it can be displayed in any supported script without needing much re-encoding, so long as the target script has symbols for all the characters in the text. However, while doing so, ISCII had to make lot of compromises. It also turned out as a very complex encoding, not readily implementable using the technologies used by other popular languages. Unicode used in Marathi language:
Script Language Unicode Range Devanagari Sanskrit, Hindi, Marathi, U+0900 to U+097F Konkani and Nepal
This keyboard layout is used for data entry in Indian languages. This layout uses default 101 keyboard. The mapping of the characters remains common for all Indian languages. All the vowels are placed on the left side of the keyboard layout and the consonants on the right side.
Web Site:www.cdacindia.com/htmlgis/standard/inscript.asp
<Insert Picture>
This is an Indian language keyboard Program which is developed by Avinash Chopde. This software helps to type text in any Indian language script by memorizing only 50-60 keys. If the User has knowledge of basic vowel and consonants of any language, the program automatically generates the 200+ characters (glyphs) which requires to correctly typeset text in any Indian language. It contains a high quality true type font (developed by Shrikrishna Patel) and a software module that run in the background under Microsoft's Windows Operating system. The software maps the ASCII English keyboard to a particular Indian Language script.
Web Site:www.aczoom.com/ilkeyb
<Insert Picture>
Copyright CIIL-India Mysore