lotus



previous page: 7. What kind of word processors are available for Hangul?
  
page up: Hangul & Internet in Korea FAQ
  
next page: 9. How can I exchange Hangul Mails?

8. What are KS X 1001(KS C 5601) and other Hangul codes?




Description

This article is from the Hangul & Internet in Korea FAQ, by Jungshik Shin jshin@minerva.cis.yale.edu with numerous contributions by others.

8. What are KS X 1001(KS C 5601) and other Hangul codes?

In 1997, Korean standard body made a rather drastic change in the naming
scheme of standards for information exchange and processing. What used to be
refered to as KS C 56xx - KS C 59xx were renamed as KS X xxxx. The following
summarizes the change. [Posted by Prof. Kim, Kyongsok at
kskim@asadal.cc.pusan.ac.kr to han.comp.hangul]

o KS C 5601 -> KS X 1001
o KS C 5657 -> KS X 1002 : additional characters for information exchange
o KS C 5636 -> KS X 1003 : Korean version of ISO 646/US-ASCII
o KS C 5620 -> KS X 1004 : ISO/IEC 2022
o KS C 5700 -> KS X 1005-1 : Unicode 2.0/ISO-10646
o KS C 5697 -> KS X 1023 : ISO 2375
o KS C 5861 -> KS X 2901 : Korean Unix environment

The most widely used coded character set (CCS. For the sake of clarity, I
adopt the terms defined in RFC 2130 and RFC 2278) for Korean(Hangul,Hanja
and symbols) is KS X 1001(used to be KS C 5601)(Wansunghyung. For English
translation of KS C 5601-1987, see
http://www.itscj.ipsj.or.jp/ISO-IR/149.pdf). KS X 1003(used to be KS C 5636
:a Korean equivalent of US-ASCII/ISO 646) and KS X 1001 are two coded
character sets for EUC-KR(Korean EUC. See KS X 2901 which used to be refered
to as KS C 5861 and RFC 1557) encoding(Character set Encoding Scheme : CES)
used on all three major platforms, Mac OS, Unix, and MS-DOS/MS-Windows. In
mid 1980s when IBM compatible PCs were introduced in Korea, a few variants
of Johab encoding(CES) were used and one of them is still used in some
programs under MS-DOS(please, note that it is all but impossible(at least
hard) to be used in Unix and Internet because it's not compliant to ISO
2022). Besides, there's one minor encoding, N-byte code(de-facto Unix
standard code until mid 1980's). [Contribution by Choi,Woohyung]

Drawbacks of KS X 1001 include: only 2,350 Hangul syllables out of 11,172(19
x 21 x (27+1) ) syllables in modern Korean are included and its way of
enumerating 2,350 syllables doesn't reflect the unique characteristic of
Hangul(composing syllables out of 2 to 5 jamos). For these reasons, a number
of people opposed adopting it as the national standard and insists that
Johab encoding(by which I mean 'Sang-yong Johab encoding' as used in MS-DOS)
which can encode all of 11172 syllables be used instead. Taking into account
the fact that it's virtually impossible(or very hard) to use 'Sang-yong
Johab encoding' in Internet and Unix, adoptation of ISO-2022 compliant KS C
5601 as CCS and the most natural encoding of it along with US-ASCII/KS X
1003(KS C 5636), EUC-KR was near-best compromise. Moreover, KS X 1001(KS C
5601-1992 : updated version of KS C 5601-1987) does have an provision on
how to represent 8822 syllables not included in a set of the precomposed
syllables(2350) with 8byte sequence. In this light, it's NOT the standard
BUT those who didn't implement the standard to the fullest who are to blame.

KS X 1001 (KS C 5601-1992) lists in Annex 3 Johab encoding, but my
understanding is it's only for the sake of reference.

EUC-KR is an 8bit encoding(CES) of KS X 1001(KS C 5601-1987) coded character
set and KS X 1003(KS C 5636:Korean version of US-ASCII)/US-ASCII coded
character set based on AT&T Extended Unix Code scheme and is widely used in
Unix,MS-DOS,MS-Windows, and Mac. MS-DOS/Windows and Mac use slightly
different encodings with platform-specific extensions. MS added an ad-hoc
extension in Korean MS-Windows 95/98 to represent additional 8822 Hangul
syllables and came up with Unified Hangul Code or CP949(Windows-949). For
Korean MacOS extension, see
http://developer.apple.com/techpubs/mac/TextEncodingCMgr/TECRefBook-151.html#HEADING151-0.
Other encodings of KS X 1001(KS C 5601-1987) and KS X 1003(KS C 5636)
include ISO-2022-KR(7bit. Korean Mail Exchange Standard;See Subject 9 and
RFC 1557), 7bit ISO-2022(Refer to CJK.inf), and ISO-2022-JP-2(which deals
with not only KS C 5601 but also Chinese and Japanese character sets. See
RFC 1554 and CJK.inf mentioned below) For most people, EUC-KR(encoding/CES)
is interchangeable with KS C 5601(coded character set/CCS) and US-ASCII/KS C
5636 as they're in most cases (actually only exceptions are use of 7bit
ISO-2022-KR encoding/CES in mail exchange Emacs/Mule which uses another
encoding based on code switching technique specified in ISO-2022. X11
Compound Text encoding is similar to what's used by Emacs/Mule) encoded in
8bit EUC-KR although they MUST be distinguished from each other when
working on internet and national standard. Making it more confusing to some
people is use of EUC-KR and ISO-2022-KR as the value for charset
parameter in MIME Content-Type header. However, this usage is justified
because the definition of charset in MIME is almost identical to that of CES
as defined in RFC 2130 as long as Korean and Chinese/Japanese encodings(CES)
and coded character sets(CCS) are concerned. Accordingly, use of
ks_c_5601-1987(the name of coded character set) as the value of MIME charset
parameter as in some internet applications(most notably MS FrontPage 3.0 or
later) should be avoided at all cost. I'm not an expert on this
subject(distinction between character set and encoding) by any means and my
explanation is bound to have misleading statements and even downright
mistakes. I'd be very grateful for any correction and comment. A good
reference for terminology involving code and character set is RFC 2130
available at Internic(ftp.internic.net/rfc) and other national information
centers (e.g. ftp.krnic.net).

In December, 1995, Korean standard body officially published a new Korean
standard character set, KS C 5700(it's renamed as KS X 1005-1 in 1998) ,
which is based on ISO.IEC 10646-1 and Unicode 2.0. KS X 1005-1 and Unicode
2.0 or later are different from ISO 10646-1:1993 in that they contain all of
pre-composed Hangul syllables in modern Korean(11,172) instead of subset of
them(6,656) in ISO 10646-1:1993 and Unicode 1.1. Moreover, KS X 1005-1(KS C
5700) contains all of hangul phonetic alphabets(240 HANGUL JAMOs) in antique
as well as modern Korean for 'Ch'ot-ga-kkut'(combinational Hangul) code, and
94 phonetic alphabets for compatibility with KS X 1001(KS C 5601).

To convert EUC-KR encoded text to and from one of Unicode encodings (Unicode
Transformation Format, UTF-8,UTF-7, and "Unicode-native" encoding,
UCS-2/UTF-16.), one can use tcs, a utility made by Plan9 team at Bell
laboratories and uniconv(included in yudit, Unicode editor. See Subject 3).
I found uniconv superior to tcs in that it supports UTF-7 (not supported by
tcs) as well as UTF-8 and UCS-2/UTF-16(Big endian, Little endian). On the
other hand, tcs supports more national encodings than uniconv. tcs is
available in ftp://plan9.bell-labs.com/plan9/unixsrc/. As of Nov.,1997, tcs
doesn't support Unicode 2.0/KS X 1005-1. To make it compliant to Unicode
2.0(as far as Korean is concerned), you have to replace ksc.c in the
original with mine available at http://pantheon.yale.edu/~jshin/faq/ksc.c
before compiling it. You may also wish to replace ex08.ok(UTF8 encoded
version of ex08.src) in the original tcs source with mine at
http://pantheon.yale.edu/~jshin/faq/ex08.ok to prevent regress (check up
script included in tcs) from complaining. Unicode archive (at
ftp://ftp.unicode.org) has a set of C routines and mapping tables one can
use to build converter between various Unicode transformation
format/encoding(UTF8,UTF7,UTF16,etc) and ISO-2022-based encoding such as
EUC-KR. Mapping tables for CJK are in /Public/MAPPINGS/EASTASIA and C
routines are in /Public/PROGRAMS. You need to note, however, that
KSC5601.TXT in Unicode ftp archive and Unicode 2.0 CD-ROM is actually UHC/MS
Code Page 949/Windows 949(see below) to Unicode 2.0 mapping table instead of
KS X 1001(KS C 5601-1992) to Unicode mapping table as it claims to be. The
correct mapping table for KS X 1001(KS C 5601-1992) and Unicode 2.0 is
available at http://pantheon.yale.edu/~jshin/faq/KSX1001.TXT.gz. I also
prepared the mapping table between JOHAB encoding and Unicode 2.0 at
http://pantheon.yale.edu/~jshin/faq/JOHAB.TXT.gz.

Microsoft Korea came up with its own Hangul encoding, UHC(Unified Hangul
Code: MS Code Page 949, Windows-949) stripping Hangul of its unique metit as
'phonetically-combined-writing' system and treating it just like Chinese
letters, use it in Hangul Windows 95 and Windows NT (in case of Korean
Windows NT 4.0, all internal processings are done in Unicode, but on the
surface, it used UHC) despite repeated advices by Korean government to adopt
ISO-10646. UHC is upward compatible with EUC-KR(Korean EUC) and assigns
Hangul syllables not covered by KS X 1001(KS C 5601-1987) (11,172 - 2,350)
to code points in CR range(in ISO-2022) and some empty slots not used by
EUC-KR. Unlike EUC-KR, the second octet of two octet sequence to represent a
Hangul syllable may be in GL range(0x21-0x7e), which makes it harder to tell
characters drawn from KS X 1001(and Annex 3) from characters belonging to
US-ASCII.

For more details on Hangul code, refer to following documents:

o Unicode and Hangul (at
http://camis.kaist.ac.kr/~jwjung/seminar/hangul-i18n) by Jung, Joowon
o Han Soft home page(the vendor of Hantorie a Hangul solution for Mac.
o CJK Information page by Ken Lunde(lunde@mv.us.adobe.com) of Adobe. Among
many documents listed there are cjk.inf at
ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/ with very
extensive (although heavily tilted toward Chinese and Japanese and not
up-to-date about Korean software) information on issues arising from
implementation of Korean,Chinese,and Japanese supports including and not
limited to Hangul code and coding system of Chinese and Japanese and CJK
character set server at
http://www.ora.com/people/authors/lunde/cjk-char.html
o Another very extensive document concerning Korean as well as Chinese and
Japanese coding system is found at
http://www.ifcss.org/ftp-pub/software/info/cjk-codes/.
o Lee, Sanglo has collected a very extensive set of information about
Hangul code including many of pages mentioned in this page and KS X
1001(KS C 5601) and KS X 1005-1(KS C 5700) table at
http://trade.chonbuk.ac.kr/~leesl/code/. The identical information is
available at http://suny.multi.co.kr/~leesl/code/.
o Prof. Kim, Kyeong-seok of Pusan National Univ. has pages with extensive
information on Hangul code at http://asadal.cs.pusan.ac.kr/hangeul/.
o Roman Czyborra put up an excellent web page on Unicode and character
sets/encodings with a number of fonts,sample documents, tables and many
other useful links at http://czyborra.com/.
o The most technically oriented may want to refer to following pages
o KS X 1001(KS C 5601-1987)(summary) as submitted to the ISO(in English)
is available at http://www.itscj.ipsj.or.jp/ISO-IR/2-4.htm along with
North Korean(KPS 9566-97),Japanese and Chinese standards. (Erik van
der Poel of Netscape posted this info. to a newsgroup). Other graphic
and control coded character sets can also be obtained at
http://www.itscj.ipsj.or.jp/ISO-IR/.
o A number of the original ISO standard documents including ISO-2022 are
available (free of charge) in PDF and MS-Word format at
http://www.ecma.ch. ISO-2022 is refered to as ECMA 35. This precious
piece of information was passed along to me by Werner Lemberg at
sx0005@sx2.hrz.uni-dortmund.de.
o The international standardization subcommittee for coded character
sets: http://www.dkuug.dk/JTC1/SC2/
o The Guide to Open System Specification(European Union) :
http://www.ewos.be/tg-cs/gtop.htm.
o The technical committee for the multilingual and multicultural Europe
: http://www.stri.is/TC304/default.html
o Lee, Jaekil has made an excellent page regarding Hangul code(and true
type fonts) especially geared for Windows NT/95 at
http://www.seodu.co.kr/~juria/hangul/. It's a must for Windows 95/NT
programmers(and users as well).
o Inside Macintosh has a very brief but very clear explanation for EUC-KR
and other encodings. Online version is at
http://developer.apple.com/techpubs/mac/TextEncodingCMgr/TECRefBook-151.html#HEADING151-0.
o Kosta Kostis' collection of information on Unicode and translator among
many different character sets and encodings is found at
ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/

Conversion table among several Hangul codes mentioned above are available at
following locations

o ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/hangul-codes.txt for
11,172 pre-combined Hangul syllables
o ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/non-hangul-codes.txt for
5,874 non-hangul characters in KS X 1001(KS C 5601-1992) (4,888 hanja and
986 symbols)

HCODE is a Hangul code conversion program written by June-Yub Lee at
jylee@kitty.cims.nyu.edu. It can deal with ISO-2022-KR encoded code (de
facto standard for hangul mail exchange), KS X 1001-Extension, Sambo(Trigem)
Johab, and Hangul Romanization code as agreed upon by both Koreas. The
newest version is hcode 2.1-mailpatch2(patches by me to fix some glitches in
handling ISO-2022-KR and B/Q encoded header of Hangul Mail as specified in
RFC 1557) available in /pub/hangul/code/hcode at CAIR archive and its
mirrors. HCODEis fast,small,and most importantly it's flexible so that it's
very easy to add new code such as one's own Romanization code and Unicode(as
adopted in KS X 1005-1). MS-DOS binary of the newest version of
hcode(2.1-mailpatch2) (hcode21m.exe compiled with old Turbo C 3.0) was
uploaded to /incoming/hangul of CAIR archive and /incoming of HanaBBS
archive. It'll be moved to /hangul/code/hcode at CAIR archive.

A set of Hangul code converters(Johab,Wansung,two coding systems included in
KS X 1005-1) is included in a word processor(MS-DOS) for ancient Korean
developed at Pusan Nat'l Univ.. It's available at
http://asadal.cs.pusan.ac.kr/ohwp. [Posted by Prof. Kim, Kyongsok to Hangul
Usenet newsgroup, han.comp.hangul]

GNU recode has been in the middle of rewritting to use Unicode (more exactly
one of its encodings) as the central encoding to convert among multitude of
coded character sets(CCS)/character set encoding schemes(CES).

In addition, I wrote a simple-minded code converter between ISO-2022-KR and
EUC-KR(8bit encoding of KS X 1001+KS X 1003/US-ASCII), hmconv, which is
available in /hangul/code/hmconv at CAIR archive.It doesn't have glitches of
hcode mentioned above and works well as a filter for Hangul mail exchange.
See Subject 9 for more on how to use it in Hangul mail exchange. Binaries
for MS-DOS(compiled by me with Turbo C 3.0) and MS-Windows binary (compiled
by Yi, Yeong-deug. No GUI, but requires MS-Windows to run) along with a
brief document was uploaded to /incoming/hangul of CAIR archive and will be
moved to /hangul/code/hmconv.

According to Lee Q-Young at ggangsi@hanmesoft.co.kr, MS-Windows NT users can
convert documents in EUC-KR(8bit encoding of KS X 1001 + KS X 1003/US-ASCII)
to KS X 1005-1 (Unicode: I'm not sure which encoding is used in NT,
Unicode-with network byte order: ISO-10646 BMP?- or UTF8) by loading them
into notepad and choosing "Save in Unicode" when saving them back in
different names.

CHAMEL is a code converter for IBM-PC, and it can convert files between
Johab and KS codes. It's author is not reachable from Internet.
[Contribution by Choi,Woohyung]

Ken Lunde (lunde@adobe.com) informed me that North Korean recently published
KPS 9566-97. It's similar to KS C 5601-1987(in that it's conformant to
ISO-2022) and contains 2,679 Hangul syllables and 4653 Hanja(Chinese
ideograms used in Korea). One funny thing about it is it put aside separate
code points for 6 syllables which make up the names of their iron-fist
dictators.

 

Continue to:













TOP
previous page: 7. What kind of word processors are available for Hangul?
  
page up: Hangul & Internet in Korea FAQ
  
next page: 9. How can I exchange Hangul Mails?