Sunday, August 17, 2014

Tamil Internet 2000 : Directions to the Digital World

Tamil Internet 2000
Directions to the Digital World

Singapore 22-24 July 2000
Bringing Tamil Literature Online:
Status report on Project Madurai

[see also Introduction to Project Madurai]
Kumar Mallikarjunan, Biological Systems Engineering Department Virginia Polytechnic Inst. & State University Blacksburg, VA, U.S.A. and K. Kalyanasundaram, Institute of Physical Chemistry, Swiss Federal Institute of Technology, 1015 Lausanne, Switzerland

Abstract: Electronic versions of printed texts (abbreviated as ETexts) of ancient literary works are important pedagogic and scholarly resources. Stored in easily accessible archives, they permit preservation and wider distribution of ancient literary works around the globe through the means of Internet. Etexts of literary works also allow quick search for phrases, words, and combinations of words in any literary work.

"Project Madurai", an open and voluntary initiative, was started from January 1998 to collect and publish free electronic editions of ancient Tarnil literary classics. Most of the released works have been typed-in by volunteers around the world. Archival of the Etext is being done in one of the most readily accessible formats (Plain text, web pages and PDF files) for use on all popular computer platforms. Anyone located anywhere may download a copy for personal use or read what we publish on the Internet, free of charge. Major emphasis is given to produce Etexts of the Tamil Literary classics in native Tamil script.

Volunteers around the world coordinate the project; most of them have the Project Madurai as their hobby. Currently we have 200 volunteers. Up to date, we have released about 200 works totaling around 8000 pages. Various issues related to this grand project were presented in this paper.

Introduction
Electronic versions of printed texts (abbreviated as Etexts) of ancient literary works are important pedagogic and scholarly resources. Stored in easily accessible archives, they permit preservation and wider distribution of ancient literary works around the globe through the means of Internet. Etexts of literary works also allow quick search for phrases, words, and combinations of words in any literary work.

Several etext archiving projects have been taking place world wide particularly in the last five years. There are already a handful of etext archiving projects of Tamil works, some by major universities:

University of Cologne, Germany 
University of California, Berkeley
Institute of Asian Studies, Chennai
O National University of Singapore and International Institute of Tamil Studies

And by a handful of individuals:

O Dr. Thomas Malten and colleagues at the Institute of Indology and Tamil Studies (IITS) of the University of Cologne, Germany have already made phenomenal progress in electronic archiving of Tamil literary works and have the most extensive collection of electronic texts in Tamil in transliterated format ever made. A few of these works are available in entire form for download. The gopher server at IITS does provide word search for a large number of early Tamil literature (Sangam and post-Sangam period).

O Dr. Parthasarathy Dileepan of Tennesse, USA enlisted the assistance of volunteers through the newsgroup soc.culture.tamil and has produced the transliterated etext version of the entire Nalayira Divya Prabhandam .

O Mani Varadarajan has a Web page devoted to Vaishnavism. Etexts (transliterated/roman) of a number of vaishnavaite literary works are available there.

O English translation of thirukuRaL The Tamil Web page of Janahan provides a pointer to English translation of thirukuRaL produced by the Himalayan Academy.

Siddharth Ramachandramurthi has a Web page where the entire thirukuRaL work is displayed in the form of GIF images. The Web page allows word check on the entire thirukuRaL.

O Ganesh Subramanian has put up web pages devoted to Saiva Siddhantha wherein he is trying to provide transliterated (ITRANS) version of various thirumarais on line.

O The story related to and the transliterated version of Abirami Andadhi (keyed in by Mrs. Vijayalakshmi Mallikarjunan) and songs of Papanasam Sivan are available at the Web page put out by Mallikarjunans.

Through postings in the soc.culture.tamil (Usenet), tamil@tamil.net  (email discussion list) and webmasters list, many expressed the desire to start similar projects, targeting ancient Tamil literary classics. Based on interests through the mailing lists tamil@tamil.net  and webmasters@tamil.net , the idea for one grand project emerged.
The goals of the project were to coordinate scattered activities such as mentioned above. With like-minded persons expressing interest, the "Project Madurai", an open and voluntary initiative, to collect and publish free electronic editions of ancient Tamil literary classics, officially took off on Pongal (Tamil New Year) day of January 14, 1998. Since its inception, Project Madurai has been working hard to bring many Tamil classics to digital world.

In this paper, authors will attempt to give a brief overview of issues related to building digital collections of Tamil works and status report on Project Madurai activities.

Building Digital Collections of Tamil Works World Wide

In building digital collections of Tamil works, various organisations and individuals have attempted (look at the partial list given in the previous section) to use variety of approaches including encoding, presentation format and duplication of efforts. These are some major issues that need to be addressed in any present or future efforts by the international body interested in building digital or virtual library.

Encoding Format:

Currently, the digital collections available through Internet resources (gopher, web and FTP servers) use varying degrees of encoding standards. They include, bud not limited to, Mylai, Adhawin, Tamilnet, WebTamil, Amutham, Murasu and etc. With the improvement in information technologies in delivering various font faces, newer encoding systems are mushrooming up every day.

In order to bring the digital collections for good uses like research, retrieval and searching, encoding format options should be limited to one or two standards. Glyph based encoding standards like TAB/TAM or TSCII, character-based encoding standards like Unicode. There is an increasing popularity for Unicode based fonts due to the adaptation of this in Macintosh and Microsoft (Windows 3.xx195/98/NT) platforms. Text converters are available to go between these formats. These converters should also provide support for standard Romanized transliteration schemes as well.

Presentation, Distribution and Archiving Formats:

With recent advances in browser technology, presenting Tamil texts in GIF images have phased out and on line documents are using many different techniques to present the Tamil. Web versions based on HTML using FONT FACE definitions have been used widely since HTML 3.0. This limited the presentation on line due to the need of the specific font face in the client's computers.

After HTML 4.0, newer ways of presenting the text (e.g. using META tags, or using cascade style sheet or CSS specifications) have been explored. The use of META tags or Style Sheets helped the use of set of fonts (FONT families) that use a specific encoding (e.g. TAB or TSCII). With this, the recent developed standards could be used widely. Still, this method requires the font files in the clientess computer. However, the user could use the font of his choice to view the files.

Recently, dynamic font rendering techniques have been explored due to its advantage of not needing the specific font file in the client's computer. Newer developments including extensible markup language (XML), wireless markup language (WML), user interface markup language (UIML), many other alternatives are being sought for presenting Tamil documents on line.

Distribution of etext files as plain text (either in Tamil or as Romanized transliterated) has been practiced by many services including Project Madurai. However, with the need for distributing the digital works as formatted, available options are as web pages or as Adobe Acrobat portable document format (PDF) files. use of PDF files can be more useful due to lack of dependence of specific font files in the client computers, and incorporation hypertext and search capabilities in the PDF files.

Searching:

Currently, very iimited search engines (bilingual) exist to locate the availability of a given work as etext and also to locate specific words or word sequences in archived texts due to the lack of standards or adopting such standards. Recent developments in encoding standards (TSCII and TAB/TAM) would provide opportunities for such search engines for extensive Tamil literature available on line.

Duplication and Reproduction:

The major problem that exists today with building digital collections is lack of coordination or collaboration among various organizations. Due to this many works have been duplicated at many places. To overcome duplication efforts, a fixed collaboration mechanism for regular exchange of information on etext collections should be made by different efforts. Reproduction of etexts from collections of one project elsewhere and in other projects and also in various websites should be critically addressed.
Project Madurai coordinators have been approached by many profit organizations to reproduce the content in their sites. With the goal of providing the services free of charge, we have kept our promise by not letting profit organisation to reproduce the contents of Project Madurai. However, we have not ruled out for a central database (or links) for different worldwide-scattered efforts by agencies such as INFITT.

Yet another issue regarding building digital documents is copyright. In relation to ancient works published in the recent days, the questions on who hold the copyright and for what? remains to be solved. In many instances, it is assumed that the publishers of today can claim copyright for explanations and presentation format, but not for the old Tamil literature. However, when considering developmental works in bringing old Tamil literature in palm leaves to digital world, the copyright should be given to the persons responsible for deciphering the palm leaves. It still remains controversial and an international body along with respective governments should mediate in these situations and formulate some standards.

Status of Project Madurai

Since its inception, Project Madurai has more than 200 works totaling more than 8000 pages. Most of the released works have been typed-in by volunteers around the world. Archival of the Etext is being done in one of the most readily accessible formats (Plain text, web pages and PDF files) for use on all popular computer platforms. Anyone located anywhere may download a copy for personal use or read what we publish on the Internet, free of charge. Major emphasis is given to produce Etexts of the Tarnil Literary classics in native Tamil script.

Scope:
Scope of coverage include ancient and modern times, works of all religions/faiths (Hindu: Caivaite and Vaishnavaite; Christian, Islam, and Jain). on the choice of works, the main criteria would be honoring of copyright protection given to authors. Even though the copyright rules vary from country to country, in most of the etext archiving projects, elapse of at least 50-70 years after the death of the author is considered a safe criterion. So, as a rule of thump, Project Madurai considers works of authors who died before 1929. Hence is the tilt for archiving ancient Tamil literary classics.

The second main reason for going for ancient literature is that they are out of print and hence stand the risk of getting lost to the world. 20th century Tamil literature is largely dominated by novels and associated decrease in the sale of printed copies of ancient literary works covering other domains. Key publishing houses such as Saiva Siddhantha Trust have dropped several of their projected reprinting of classics for lack of market. Few university libraries have copies that were printed in twenties or thirties. In the absence of adequate storage facilities these works are being eaten away by insects. Tamil language has a rich heritage dating to several thousand years.
We all have a moral obligation to ensure that the future Generations do have access to this rich treasure, possibly via better means of archival and world wide distribution. Having said this, any ancient literary work for which we can find a hard/printed copy and importantly volunteers to key-in the text can be considered for inclusion in the archives.
As regards to modern literary works, they can be included provided the concerned author (or their legal heirs) is prepared to give explicit written consent for the work to be put up in electronic form and for unrestricted, free distribution on the internet (e.g. yogi Sudhdhanatha Bharathi, Vairamuthu, Uthayanan). In addition, modern works of Sri Lankan Tamil authors have been included. Through special arrangements with the nearest kins, Tamilnadu Govt. has placed "in public domain" the works of select Tamil authors of 20th C - Bharathiyar, Bharathidaasan, Armadurai, Namakkal kavingar Ramalingam pillai and few others. Hence these works are included in the project.

Volunteers

Project Madurai is based on voluntary cooperation between many people in several countries. Currently, the project has about 200 volunteers from countries including, but not limited to, United States of America, Canada, Switzerland, Germany, England, Japan, Singapore, Malaysia, Australia, New Zealand and India.

Modus Operandi for Preparation and Proof-reading of Etexts
Volunteers interested in a particular work are given the hard copies of the work. They work with regional volunteers and project leaders. Proof reading of the texts was done preferably by another group of volunteers. The etext files carry explicitly in the header part the person(s) actually involved in the keying in of the text and also the person(s) involved in the proof reading part. If possible the header also indicates the hard copy details (publisher, year etc) used as a reference for proof-reading/editing.

Font Encoding Used
The project is committed to using the Tamil Standard Code for Information Interchange (TSCII) as the encoding format for the archives. The reasons for limiting the choice to TSCII are:
(a) Over 90% of the Tamils worldwide use one of these; (b) fonts are available free for use on all of the three major computer platforms - Windows, Macintosh and Unix. Anyone who uses these computers can work in all of these formats; and (c) Convertors are available that work reliably to go between many other font formats to TSCII. The goal is to let the volunteers work in the environment (font and computer system) he/she feels comfortable with. Very minimal constraints if any will be imposed on the volunteers who will do the major task of keying in of texts.
Associated Mailing List

    A mailing list exists as a forum to discuss related issues and all the volunteers involved with Project Maduari are members of this mailing list. Issues related to software for inputting, selection of work, locating volunteers interested in a particular work are addressed in the mailing list.

Conclusions
    In conclusion, the authors would like to address various issues related to building digital collections of Tamil literary works including font encoding, presentation and distribution formats, duplication and reproduction and copyright in this International forum. The authors propose to have an international body like INFITT to coordinate various efforts to address the above-mentioned issues. The authors provided the status of a grand project 'Project Madurai' and how it addressed many of the issues.

No comments:

Post a Comment

அடுத்த 11 வது உலகத் தமிழ்மாநாட்டிலாவது சுவாமி விபுலானந்தரின் எழுத்துக்களையும் பேச்சையும் ஆய்வு செய்ய....

தமிழ் ஆய்வுலகமும் தமிழ்ப் பல்கலைக்கழங்களும் அடுத்த 11 வது  உலகத் தமிழ்மாநாட்டிலாவது சுவாமி விபுலானந்தரின் எழுத்துக்களையும் பேச்சையும் ஆ...