Monday, July 12, 2010

Indic transliteration on computers - brief history and now

There was a time when the computers only allowed Roman characters, called ASCII characters, which were 128 in number, increased to 256, to accommodate extra symbols.

Back in early 1990s, there was a newsgroup called soc.culture.india, where every Indian student sitting in US university was venting out, good, bad, ugly and the very ugly! In 15 different languages! And many errors of understanding happened due to the limitation of English alphabet.

The two passions - Hindi movie songs and Sanskrit
People (including me) on a newly formed music newsgroup (RMIM) were having a torturous time when people mistook the words of evergreen Hindi legendary lyricists' words from super golden hits. Confusions between t (त) and T(ट) or d (द) and D (ड) were causing havoc. No one was at fault, there were people who heard the words, but didn't know the proper meaning or the spelling, and for the first time, finding a venue to ask and share, started asking away.

Another area that needed this unambiguity was Sanskrit mailing lists, which were seeing the erudite blasting people for the tiniest mistake due to misinterpretation in the limited Roman alphabet. Sanskrit enthusiasts were trying to figure out their place in the promotion of Sanskrit. And it settled at encoding of Sanskrit documents to enable people access online.

Birth of ITRANS
So, some schemes, conventions, mappings started developing. One of the successful ones out of those growing pains was named ITRANS, and Avinash Chopde quickly developed a software for it. Those days of no web, and Latex being the coolest software for techie geeks, the software was only for techie use! Options to switch between English and devanAgarI were embedded, and much more complex stuff.

The Sanskrit Documents project took off with encoding using ITRANS. Similarly the Hindi movie songbook took off for Hindi movie songs (and other Indian languages as well). These two projects fueled the majority of online transliteration efforts like nothing else - both matters of extreme passion for the followers.

Sample encoding
A typical encoded text would look like this:
hariH AUM .. sha.n no mitraH sha.n varuNaH . sha.n no bhavatyaryamaa .
sha.n na indro bR^ihaspatiH . sha.n no vishhNururukramaH ..
namo brahmaNe . namaste vaayo . tvameva pratyakshaM brahmaasi .
tvaameva pratyakshaM brahma vadishhyaami . R^ita.n vadishhyaami .
satya.n vadishhyaami . tanmaamavatu . tadvaktaaramavatu .
avatu maam.h . avatu vaktaaram.h .. AUM shaantiH shaantiH shaantiH ..
AUM saha naavavatu . saha nau bhunaktu . saha viirya.n karavaavahai
. tejasvi naavadhiitamastu . maa vidvishhaavahai .
AUM shaantiH shaantiH shaantiH ..

devanAgarI -
हरिः ॐ ॥ शं नो मित्रः शं वरुणः । शं नो भवत्यर्यमा ।
शं न इन्द्रो बृहस्पतिः । शं नो विष्णुरुरुक्रमः ॥
नमो ब्रह्मणे । नमस्ते वायो । त्वमेव प्रत्यक्शं ब्रह्मासि ।
त्वामेव प्रत्यक्शं ब्रह्म वदिष्यामि । ऋतं वदिष्यामि ।
सत्यं वदिष्यामि । तन्मामवतु । तद्वक्तारमवतु ।
अवतु माम् । अवतु वक्तारम् ॥ ॐ शान्तिः शान्तिः शान्तिः ॥
ॐ सह नाववतु । सह नौ भुनक्तु । सह वीर्यं करवावहै
। तेजस्वि नावधीतमस्तु । मा विद्विषावहै ।
ॐ शान्तिः शान्तिः शान्तिः ॥

hariḥ oṃ ॥ śaṃ no mitraḥ śaṃ varuṇaḥ । śaṃ no bhavatyaryamā ।
śaṃ na indro bṛhaspatiḥ । śaṃ no viṣṇururukramaḥ ॥
namo brahmaṇe । namaste vāyo । tvameva pratyakśaṃ brahmāsi ।
tvāmeva pratyakśaṃ brahma vadiṣyāmi । ṛtaṃ vadiṣyāmi ।
satyaṃ vadiṣyāmi । tanmāmavatu । tadvaktāramavatu ।
avatu mām।h । avatu vaktāram।h ॥ oṃ śāntiḥ śāntiḥ śāntiḥ ॥
oṃ saha nāvavatu । saha nau bhunaktu । saha vīryaṃ karavāvahai
। tejasvi nāvadhītamastu । mā vidviṣāvahai ।
oṃ śāntiḥ śāntiḥ śāntiḥ ॥

Later on, some modifications or to say added conventions were used to ease it out a bit, like M for .n (anuswAra).

Then came Omkarananda Ashram's (omkArAnanda Ashrama) standalone ITransliterator, which is still available. It was the in thing, people frantically typed, gave feedback, improved it, with multiple scheme, ITRANS, IAST etc.

Then with computers reaching India with GUI, windows, fonts etc, the printing world started to change. Off set printing was going away, and early devanAgarI fonts started to develop. They used the same 256 character space, and replaced English character with devanAgarI. Various styles developed. Mappings were more like a Hindi typewriter, making someone like me who knew nly English keyboard, go nuts wondering why I get क (k) when I press t, or how to get ddha (द्ध).

If you typed in one font and put up content on web, people had to download your font. And if YOU went to 5 Hindi websites, you ended up with five fonts installed on your computer. And there was no way to take a text typed by X in font X1 and change it to a nicer/different font Y2. Nope. Someone will have to retype the whole thing. And if English characters were also used in the same piece, and you did "Select All" and change font, then you lose the English content.

Oh, the pains!

WWW as we know it
With the advent of web, and PHP and all the other things, better versions came out for web based transliteration. Back in 2003 I started a Hindi/English magazine in Cleveland, and I found a lot of fonts had many conjugate characters missing, and their mapping was messed up for someone who was typing English 9 hours a day. So I created two fonts, 'shashi' and later an improved version 'bhaarat', which had English intuitive mapping of keys and created new conjugates for the important ones. Great, but the problems still remained the same of changing fonts etc.

So I tried my hands at a web based Lex parser that would take ITRANS and my added convenience factors (eased a bit for for Hindi), and I did it for bhaarat and for Kruti font. Then I found that within Kruti itself fonts had minor mixups of key mappings.

Then came Unicode for Indic languages. This changed a lot of things. Firstly, you now needed only ONE font for all your needs. You want to write English, Greek, mathematical symbols, Hindi, Arabic, Tamil - all using one single font. Yep, the font file size bumped from a few kilobytes to 25MB!! My Big Fat Unicode Wedding, huh!

This brought in new problems though. Not all fonts were Unicode, and you had to find the Unicode font for your platform. Bigger problem was how to write in it? Since the font had thousands of characters now, you couldn't type with the normal keys. You would go to "Insert" -> "Symbol" all the time in say, MS Word. Anyways, I didn't bother to figure out this piece at all.

The advantage of Unicode was tremendous. Any site using Unicode can be viewed by any visitor. You could use different languages with same font. The greatest one was - web sites could be searched uniformly with this code! AND you could actually sort words in their native script simply by using 'Sort' of any software. This last point is a bit difficult to understand for non-techies.

The English characters are listed internally in alphabetical order. That is, after A come B and C and D etc, internally for the computer. For devanAgarI though, the consonant order is k kh g gh ~N ch chh ja etc. These when sorted by a software will not remain in this order, they will become ~N ch chh g gh ja k kh.

With Unicode this problem is solved. Want to see it? Go to Word or something, and choose Insert -> Symbol and choose Arial MS Unicode, and see the characters laid out in order!

Microsoft started to ship Windows with Indic fonts pre-installed. That was a boost to see 10+ Indic languages listed in your computer.

An exceptional tool
A wonderful tool has also been developed at which is much more powerful and precise than others. It allows multiple encoding exchanges. ITRANS, IAST, Hindi, Bengai, Gujarati etc. A highly recommended tool indeed. But this tool is unforgiving, in the sense that it does not attempt any guess work, related spellings etc. So you need to be careful in typing the exact encoding. Its use and power is still there even after Google's tools, for batch processing, multiple encoding and precision of encoding. If you know what you are typing, there is no guess work even for the not so usual or new words, not in Google's dictionary.

The G factor
The next step of real help for transliteration came from Google with its India operation and focus on Indian languages. The first time I noticed (this could have been before or different) was in Blogger editor, there was a choice of transliteration on the fly. You type the approximate spelling as if typing in English, and it changes it to proper devanAgarI script! It would given suggestions, in case of ambiguities. It even had a virtual keyboard, where you could handpick the characters for the difficult words and combinations.

They also had a widget, and added transliteration in Gmail as well. It was a wonderful thing. Then they came with the new editor, in which they removed manual editing or the keyboard. That was a step back, since it didn't give you all choices, and no choice to manually edit!

Then they also launched a web-based transliteration service that overcame all these problems, had a virtual keyboard again, and gave some 15+ language choice. Once again, great job.

Google IME - the equalizer
The latest in the series is the IME, a downloadable piece that run on your computer after a single install. Many readers outright reject it saying why install a software when you can do it on the web!

There is a major advantage of this IME from Google. While it has the full power of the existing service, here are some advantages -
  • With the net based, you needed a good network connection u all the time. Some people can't afford that, since they pay for the connect time as well, or are on a very slow connection.
  • At times, due to the browser misbehaviour or something, typing in Blogger or Gmail would cause cursor jumping. That is, you type on line 5, characters show up in line 2! No kidding, happened many times with me. This doesn't seem to happen with IME so far!
  • The IME comes between your keyboard and OS, so now for almost all softwares on your computer, you can directly type in Indic languages. Be it Firefox, Safari browsers, or Word, Excel, Powerpoint. Earlier you had to use a tool to type in Sanskirt, copy and paste from there to other destination software.

In the above image, the small floating toolbar on the upper right (above column B in the Excel spreadsheet) is that of the IME. As you type, the possible choices are displayed underneath (under namastubhyam in the example above).

To install the IME, go to the IME link and download the proper version for your computer. Watch out for the 32 bit and 64 bit versions. For each language, you will need to install one IME. This is only one time, so go ahead.

After the download, install the program.

In the settings, do set up shortcut keys, so that you can switch between English, Sanskrit, Hindi, Bengali easily. I have set up Control-Shift-1 for English, Control-Shift-2 for Sanskrit and Control-Shift-3 for Hindi.

Now, for any application that takes Unicode, you can press the shortcut key sequence, and a small 3 icon menu pops up, that sows you that you are typing using the IME. Try it out. You can type directly in the comment field of Facebook, or in Blogger or Wordpress like in this post, MS Word, Notepad, Photoshop, Excel, Powerpoint, Skype, Gtalk ... anywhere. No more cutting and pasting from a website, when you are out of reach!

And here is even better part. If you have multiple softwares open, like a chat window and a browser, you can choose to type in English in one, Sanskrit, Hindi in another, Greek in yet another! Now that is heaven!

Give it a try. It is almost perfect solution for transliteration and computers.

Next stop? Voice to words for other languages. Let us see how long will that take.

If you have any comments, suggestions, links to new tools, or questions on transliteration on computer, please leave a comment below (Click on the Comments link or click on the post title to see all the comments below) or send an email. Do not let your desire to type in your Indian language even on computer be hampered by technical difficulties.

Disclaimer: This is not an official, complete, historical account, but my personal account of the events as I saw and followed. Any omissions are either due to ignorance or to keep the story short.

like it? then become a fan of the blog. please rate the post as well.
how can this site be made more interesting, useful? share your comments, use the comment link or the comment box below

(c) shashikant joshi । शशिकांत जोशी । ॐ सर्वे भवन्तु सुखिनः ।
Practical Sanskrit. All rights reserved. Check us on Facebook


shanthi said...

Wonderful post and keep rocking

vinay vaidya said...

I had downloaded and installed this Google transliteration tool, but it caused several complications. Either I couldn't understand well, or may be some other reason.
shall try again. I think it is there in my system, but I had un-installed it.

shashi said...

vinay, make sure you are getting the right version for your computer (32-bit 64-bit). after you download, you have to install it once. and once for each language you want. it should be pretty straightforward. what OS? PC or Mac?

L. L. Diamond said...

Has it been straightened out for MAC users -- i can't bear to read through the whole thing, try it and find out it only works for PC.
Thanks so much!

shashi said...

lalitA, the IME, standalone tool doesn't seem to be ready for Mac yet. but do try this online tool which seems to be Mac ready. i won't know exactly since i don't have a mac, but why don't you try this link and let us also know if it works.

also, the long post, is an interesting read into the history of this whole thing. do read it when you get time, you may enjoy it!

Anonymous said...

very useful
many thanks
also for such interesting of reviving sanskrit

shashi said...

for Mac users, a complete solution is as given my Apple, this enables Unicode at OS level and will help all softwares to use Unicode:
Mac OS X v10.2 through 10.3.9

Use the Character Palette feature:

1. Open System Preferences.
2. From the View menu, choose International.
3. Click the Input Menu tab.
4. Enable "Character Palette".
5. Quit System Preferences.

The Character Palette menu now appears in the menu bar, to the right of Help. Choose Show Character Palette from the menu when you wish to add special characters while using Unicode.

-- one user has confirmed that it works like charm on his mac mini.

Anonymous said...


This is a very good one. Thanks.

Is there a good solution for typing in indic language instead of having to transliterate it. I find it difficult to transliterate some of the words. I would rather type it मात्रा by मात्रा.

I used the web transliteration to write matra in devanagri. Thanks again.

shashi said...

there are two issues.
1. typing as one writes in devanAgarI. this requires the mAtrA to be separate character. and this was indeed the case with pre-unicode fonts, including kRuti, chANakya, bhaarat etc.

2. typing so the end product is computer friendly (aka unicode). in this since it is computer friendly (i.e. sortable in dictionary order, accommodating 40 other fonts etc.), it makes it difficult to type directly without the help of interpreting software (web based or standalone).

my technical and personal suggestion is that if you want to work online content, do go for uniode. if you only plan to print books etc. (PDF is fine), then you can choose any font of your choice. to put a book online, you can put the PDF format, but it will remain unsearchable.

there are some very fancy fonts for effects, but they are not unicode, so don't leave non-unicode fonts outright.

Anonymous said...

tough competition to google

Sachin Garg said...

Thanks for the amazing history!!!

shashi said... - I saw the site. good initiative, the algorithm and data behind both may actually be mutually shared also. Though it is not difficult to create it from scratch. But the IME part of google is indeed worth appreciating, for it enables to enter Indic text in ANY software that accepts Unicode.

I am able to type devanAgarI directly in Excel and Word, and it is extremely timesaving tool.

@sachin - You are welcome.

thiruthiru said...

The history behind and all the pains developers like you underwent is really amazing. I am using IME for sometime to write some Devanagari quotes. I experience some difficulty . For example when I want to write Ranga the "ngga" is always typed as रङ्ग only whereas I want it as a joint character. How to do it? In baraha I can type it as ra~gga and get it correctly done. I am unable to paste that exactly here as it is shown in Baraha. All other softwares fail in this regard. What is the difference between Baraha and IME?

shashi said...

thiruthiru, this seems to be a unicode issue. unicode doesn't have a conjugate for this, hence is uses the halant. it can only be solved by using a font that has a conjugate, like some of the ones you see the image above fr 'shashi' font. the 'sanskrit new' wasted many character spaces by giving 'sha' (with vowel) and 'sh' (half) as separate characters.

you will have to use some other font for this. it is unicode limitation. even if you try manually through the virtual keyboard or insert symbols, you can achive it.

NIRMAL said...

i wanna get some tools for sanskrit typing

ambika said...

thanks... this piece of information was a big help!!

sanjiv said...

Great article with all the required information at one place. I struggled with devanagari fonts for many years. This blog gives the much needed clarity. By the way, which editor/too/setting I can use, where keyboard is phonetic but data is stored in unicode format ?

Bhavesh Patel said...

Greate Article .. Thank you..

Post a Comment

please do ADD your NAME and PLACE, after the comment.