Tech Forum

Posted: **Thu Feb 22, 2007 8:48 am**

I am a Church employee and have enjoyed reading your ideas and thoughts in the forums. Just wanted to make you aware of a need we currently have, in case you would like to volunteer to help with one of our current challenging and interesting projects.

I'm a developer on a Church system that must take members' names recorded in any alphabet or writing system (such as Greek, Thai, Japanese, etc) and produce a "Romanized" version of the name (i.e. the name written in the standard Latin A-Z alphabet used for English and other western languages) for use throughout the Church.

We are using an open source library called International Components for Unicode (ICU) to accomplish this transliteration, and have found that for certain writing systems, the ICU library needs improvement.

Specifically, ICU's Thai-to-Latin transliteration produces names that are not pronounceable. For example, ICU transliterates the name "ออ่นเสมอ, อังคณา" as "Xx̀nsemx, Xạngkhā". There is a well-documented, standard set of rules called the "Royal Thai General System of Transcription" that can produce pronounceable romanizations of Thai names (it transliterates the example name above into "A-onsamoe, Angkhana"), but these rules have not been implemented in the ICU software.

It would be very helpful to us if someone would add the "Royal Thai General System of Transcription" standard rules to the transliterators available in the Java version of the ICU library. You don't need to speak or read Thai to be able to do this. Yes, knowing Thai would make the task easier, but if you can visually match a Thai character between the Unicode documentation and the Royal Transliteration rules, you can write the software.

If you are interested, here are some links that might help get you started:
- Documentation of the Royal Thai standard is here:
http://en.wikipedia.org/wiki/Royal_Thai_General_System_of_Transcription
(can be slow to load, but will eventually come up)
and here:
http://www.arts.chula.ac.th/~ling/tts/principles_eng.pdf

- A website with a demo implementation of the standard is here:
http://www.arts.chula.ac.th/~ling/tts/
(we need the option labeled "Roman" on this site)

- General information about the ICU library is here:
http://www-306.ibm.com/software/globalization/icu/index.jsp

- The portion of the ICU user guide that discusses transliterations and how to implement your own is here:
http://icu.sourceforge.net/userguide/Transform.html
and here:
http://icu.sourceforge.net/userguide/TransformRule.html

- A demonstration of the ICU transliteration functionality is found here:
http://demo.icu-project.org/icu-bin/translit

- A chart of the Thai character assignments in Unicode is here:
http://www.unicode.org/charts/PDF/U0E00.pdf

- A Unicode editor that will allow you to enter the Thai characters by codepoint is here:
http://www.unipad.org/main/

- Sample Thai text for testing can be found here:
http://www.unicode.org/standard/translations/thai.html

Thanks in advance to anyone who can help.
Regards,
Cindy

Posted: **Sat Feb 24, 2007 5:22 pm**

Posted: **Sat Feb 24, 2007 5:54 pm**

The project you propose sounds interesting. As a current student I wouldn't want to take on the entire project due to current time restraints. If you had a program similar to FamilySearch Indexing where I could do batches, I'd go for it.

Even if such an Indexing-eque model is not developed, I'd like to see perhaps a website where little projects like this could be posted along with a history of what has been done and where they could still use help (for those projects that are more modular).

Posted: **Sat Feb 24, 2007 7:06 pm**

I could be speaking completely out of place here, but I thought that Tom either had it setup or has plans to setup a sort of lds sourceforge for project development. I would like to see this happen for multiple reasons. I want to try an open source a Java project using our internal Stack just to see how many developers would really bite on the opportunity.

Posted: **Sat Feb 24, 2007 7:12 pm**

bh5k wrote:I thought that Tom either had it setup or has plans to setup a sort of lds sourceforge for project development.

That would make sense for LDS projects, but this is a requested enhancement to an existing non-LDS open source project. Would that be the right way to do it?

Posted: **Sat Feb 24, 2007 7:38 pm**

cindyc wrote:I'm a developer on a Church system that must take members' names recorded in any alphabet or writing system (such as Greek, Thai, Japanese, etc) and produce a "Romanized" version of the name (i.e. the name written in the standard Latin A-Z alphabet used for English and other western languages) for use throughout the Church. (emphasis added)

This is a Church project that Cindy is referring to, and therefore for Church projects where they could use help, I propose such a website to manage the assistance they receive.

Posted: **Sun Feb 25, 2007 6:11 am**

RussellHltn wrote:That would make sense for LDS projects, but this is a requested enhancement to an existing non-LDS open source project. Would that be the right way to do it?

This is a project already hosted elsewhere and any changes that we make would be made to the upstream project so feel free to post your changes to the upstream or to us and we will review and post to the upstream.

I'd like to have some source code hosting here but am working through various concerns we have before doing that.

Cindy, is it possible to break up the work into smaller chunks?

Tom

Posted: **Sun Feb 25, 2007 1:09 pm**

kgthunder wrote:This is a Church project that Cindy is referring to, and therefore for Church projects where they could use help, I propose such a website to manage the assistance they receive.

To coordinate the assistance? That's fine. I thought people were talking about trying to host the source code for something that is already hosted at sourceforge.

I took a peak at some of the links. The first step may be to wrap one's head around the task. I still don't understand the charts. The second would be to understand the existing program. It may be nothing more then writing the rules then "real" programming.

This may be a "road crew" style project. One to do the work, one to act as a second pair of eyes and a third to supervise.

Posted: **Mon Feb 26, 2007 7:27 am**

I agree with Tom, we need to break it down and then create a bunch of teams that work together on the smaller parts, similiar to the road crew idea of Russell. That way we can get quite a few people working on it without everyone duplicating the effort of all.

Posted: **Mon Feb 26, 2007 9:59 am**

tomw wrote:Cindy, is it possible to break up the work into smaller chunks?

Hmmm...I can't think of a straightforward way to break down the task any smaller than a single writing system (such as Thai), but I'll describe the approach I've taken with other writing systems, in case that triggers your ideas on how to subdivide the work.

(By the way, we need help on other writing systems too...like Armenian, which could easily be done in parallel with Thai, and will probably be more straightforward than Thai. If you're interested, let me know and I'll provide information about our Armenian needs).

RussellHltn is right...the first step is to wrap one's head around the task. To help you do that, I'll list the approach I've used for this type of work in the past.
1. I use the ICU Transliterator.toRules method to dump out the current rules ICU is following. The code to do this would look like:
public static void main(String[] args) throws Exception{
Transliterator thaiLatin = Transliterator.getInstance("Thai-Latin");
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("ThaiLatinRules.txt"), "UTF-8"));
out.write(thaiLatin.toRules(true));
out.close();
}

2. Then, I open the rules file I generated above in a Unicode capable editor. Sometimes I find that the rules are just a series of calls to other transliterators, so I repeat step one on the sub-transliterators, until I understand what is happening and get down to very specific rules....stuff that looks like:
ห > h̄;
ฮ > ḥ;
where you can see how each character is being mapped to a Latin character.
(These rules are explained in detail in the ICU user guide, which I've put a link to above).

3. Then, I compare the ICU rules with the rules of the standard I'm trying to implement (in this case, the Thai Royal standard). I create my own custom transliterator with the ICU rules as my starting point, and then iterate through a process of tweaking the rules, testing the tweak, and repeating, until my custom transliterator matches the standard I'm trying to implement. How do I know if I've matched the standard? I compare my output to other software (such as the demo website that implemented the Royal Thai standard) that implements the standard. At that point, I then send sample transliterated data to our office in the local country for their review.

If someone could help us get to the point where a custom ICU Thai transliterator produces results that match the Royal Thai demo website (http://www.arts.chula.ac.th/~ling/tts/), then I could send sample data to our Bangkok office for their review and further refinement.

Hope this helps. Thanks again.

Tech Forum

Opportunity to help Church developers

Opportunity to help Church developers

Indexing model?

Project Sites