Hakka News — Adding 11 Unicode Characters

By Peter Burkimsher

This article describes how reading the Bible led to exchanging emails with world experts in linguistics, and hopefully adding to Unicode, the international standard we use for typing text on computers.

I’m currently looking for a job. Please email me at peterburk@gmail.com if your company has a suitable position, or if you have a project for me that could lead to a reference letter.

For the last 4 years, I was making memories in Taiwan. Literally! USB memory, SD card memory — computer memory cards. I wrote control systems and logging software at InfoFab/OSE. My colleagues spoke Mandarin, and I knew that I needed to learn Chinese to overcome the language barrier.

All the Chinese classes were in the daytime. I asked my boss to allow me to reduce my working hours so I could attend classes. He agreed, but the HR department would not let me take only 10 hours of weekly classes at Wenzao or National Sun Yat-Sen University. Instead they insisted that I take 15 hours of classes at the National University of Kaohsiung. That would reduce my working hours below the minimum for a work visa.

So I asked my friend Wendy to be my private tutor. I paid her out of my own pocket, and met her once a week for language exchange. I’m not gifted with learning languages. She patiently tried and failed 10 different ways to teach me. Eventually that led to Pingtype, my own app that helps me write, type, read, listen to, and sing in Chinese. I’m still not good at speaking, because that requires a human. But my comprehension is alright, especially in church.

An example parallel sentence processed with Pingtype

The classes kept me motivated weekly. But to improve, I need to practice daily. Therefore I decided to read the Bible in Chinese. Every day in 2017, I was reading the Chinese New Living Translation Bible, and slowly improving. As I found errors in the word spacing, I made 5000 edits to the Pingtype dictionary. I also listened to worship music from KHOP, and cut out 1081 separate MP3s of each song from their sets using Fission.

After 2 years, I met my girlfriend Alice. She had just returned from Working Holiday in Australia, and her English is excellent. Unfortunately the same cannot be said for her parents. In fact, she is “mixed-race”: her dad is Hakka, and her mum is from Tainan. Therefore she can speak 4 languages! She speaks Hakka to her dad, Taiwanese Minnan to her mum, Mandarin to her friends, and English to me.

In this article, the terms Taiwanese, 台語, Taiyu, Hokkien, Minnan, 閩南語, Southern Min are interchangeable. There are other Min dialects. Not every person in Taiwan speaks Taiwanese. If you think it’s complicated, ask a British person about the difference between England, Great Britain, and the UK.

What is important to know is that Mandarin Chinese, Taiwanese, and Hakka are not mutually intelligible — even though they use some of the same written characters, people can’t understand the other dialects. They’re as different as Latin-based languages in Europe.

The official language is Mandarin, but a lot of people speak Minnan, Hakka, or aboriginal languages

Having successfully written Pingtype for Chinese, I decided to try the same method for Taiwanese and Hakka. Time to collect data!

There’s a prayer room called ANHOP in Tainan who sometimes sing worship songs in Taiwanese. I became a fan of the punk rock band 滅火器 Fire EX, made a lyrics video of one of their songs with Pingtype, and put it on YouTube. Through the Team Fire fan club, I also discovered 一步 One Step.

Then I wanted to read the Bible in Taiwanese. There’s an online version from Lingshyang, which I downloaded and parsed as TXT. But some characters were missing. To my surprise, they are displayed as inline JPGs!

The Lingshyang Bible has images for some characters in 2 Chronicles 14:14–15

So I checked the original HTML, and found 52 characters that were shown using images instead of text. Pingtype needs plain text, so I decided to find the correct characters. Some, like 𪜶 in.jpg, were easy to find. But there were a few that I really couldn’t identify.

On the 19th of January 2018, I emailed Richard Cook from the Unicode Consortium with a list of 17 characters. He helped me find some more by teaching me how to use IDS codes to search USourceData.txt and ids.txt. Those codes break down Chinese characters into the parts that make them up. For example, 愛 = ⿱⿱爫冖𢖻.

For the characters don’t exist yet, he told me that I need to write a formal academic proposal to have them added. My initial thought was “Yeah, right — isn’t that your job?” but I restrained myself from saying that.

I was (and still am) busy looking for a job. I decided to try to get a personal introduction by contacting people through their personal websites. Grant McLean had made a Unicode Character Finder, so I emailed him and asked if he would proofread my proposal to add these new characters. He agreed, so I wrote the first draft based on the Unicode Power Symbol submission.

Grant never replied to me, so eventually I emailed Richard Cook again on the 23rd of July 2018. I also CC-ed Ken Lunde and John H. Jenkins, the other editors of the Unihan dictionary. They asked me to rewrite the proposal in another format, scan printed evidence of the characters in books, and draw the characters in a font.

How am I supposed to do that? Well, I saw that Andrew West was thanked in the other proposal. A quick search led me to his BabelStone Han page, where he is drawing fonts for many new characters. I emailed him, and he replied very quickly and drew them the next day!

There are many other characters on the BabelStone page, including some that are attributed to the Hakka Chinese Bible. Unfortunately, there aren’t chapter/verse references for when those characters are used. How am I supposed to find them?

When characters can’t be displayed in Unicode, one workaround is to use PUA codes. The Private Use Area is a section of Unicode where people can make their own custom fonts to show special characters. You don’t need to use inline JPGs like Lingshyang! That is how the BabelStone Han font works.

I guessed that the Hakka Bible might use these PUA codes. So I downloaded it from Bible.com, and parsed it as TXT. I combined all the text files together with the UNIX cat command. Then I used TextWrangler’s regex replace function to find & replace (.)(.) with (.)\n(.), which adds a newline between each character. Finally, I used TextWrangler to remove duplicate lines to get a list of all the unique characters in the file.

As I’d hoped, some of the characters were PUA codes! It was then easy to find those in the Bible text to find the correct chapter/verse reference.

For the proposal, I needed scans of a printed Bible. So I got on my bike, went over to church, and took photos of the relevant pages. Then I tweaked the exposure to make it clearer.

An evidence image from my final proposal, taken with an iPhone 5 camera

Finding characters in the hymn book was harder. While browsing, I discovered a table of characters that the Presbyterian Church had been unable to type when they made the 2009 hymnal. But I didn’t know which hymns they were in. The PPT files used in church are images-on-slides, so they can’t be converted into TXT for search or copy-paste. But thankfully I was able to get the hymns in DOC format from Dexian church, so I could search for the romanisations and figure out which hymns used each character.

I don’t have a searchable version of the 2014 Hakka hymnal, although I do have JPG scans. If anybody is interested in helping me OCR that, then please get in touch! I’m sure we’ll find some more characters in there.

When I was about to leave Taiwan, I discovered that there’s a new Taiwanese translation of the New Testament (TTV). On the way to the airport, I asked my girlfriend to pass by the Christian bookstore so I could buy it, and I quickly took photos of the pages that have new characters.

On the 30th of August 2018, Lorna Evans from SIL emailed us to say that she’d found another table of characters. That encouraged me to add the new characters from the TTV Bible, and write a final proposal.

Although the document is a PDF, it has attachments added using Adobe Acrobat Pro. Ken Lunde told me to attach the TTF font, a TXT version of the IDS codes, and the DOC proposal into the final PDF. He caught a few typos, and I finally submitted it on the 7th of September 2018.

I have a number! Document L2/18–290, the Proposal to add 9 Taiwanese and 2 Hakka ideographs to UAX #45, will be discussed at the UTC #157 meeting.

When they are added to UAX #45, they will be submitted to the next IRG Working Set. The Ideographic Rapporteur Group are another committee that define the CJK Unified Ideographs in Unicode. This will probably take about 5 years, and I hope these characters will be part of Extension G or H.

The 11 characters I proposed

After submitting the proposal, I was contacted by Eiso Chen 陈永聪, who is also interested in Hakka and Taiwanese data, as well as other Min dialects such as Teochow, Leizhou, and also Cantonese.

I think that native speakers of these languages face a lot of barriers in making changes. Scraping text from websites requires knowledge of scripting, and many linguists are not programmers. The committee members are native English speakers, and expect the proposals to be written in advanced academic English. I can’t draw fonts, so even I had to ask for help.

Now it’s been done once though, I challenge you to do the same! If you have big data, search for PUA codes in it. Look at the BabelStone page to see if you can find printed evidence of those characters. Contact me if you think you’ve found something new, or if you’re interested in helping me transcribe some of my existing scans, such as the Hakka hymnal. We can expand the character set, have our names immortalised in the world standard for text, and help entire communities to communicate digitally for the first time.

Update 2018–09–19: Good news! Ken Lunde just emailed to say: “Per an action assigned to me by the UTC today, I am hereby notifying you that the 11 ideographs that you requested to be added to UAX #45 will be added by John for Unicode Version 12.0. It is not yet clear when they will be submitted to the IRG for encoding, but at least the first hurdle has been cleared.”

I still need a job, and I’d like to find a role where I can continue side projects like this. Some companies write in their contracts that all intellectual property I create belongs to them. I’m not a lawyer, but I think that would stop me from doing this kind of research. Please contact me if you know about a suitable position for me.