Last update July 18, 2009

Unicode Issues

Difference (last change) (no other diffs, normal page display)

Added: 2a3,4

[[text][title=NOTE:]You should not be here, go here instead.]

Table of contents of this page

Glossary

D Libraries

Links for learning about Unicode

Links

What are Unicode and UTF

Newsgroup Threads Involving Unicode

NOTE:

You should not be here, go here instead.

Glossary

(adapted from NG:digitalmars.D/11247)

code unit: the technical name for a single primitive fragment of either UTF-8, UTF-16 or UTF-32 (that is, the value held in a single char, wchar or dchar). I tend to use the phrases UTF-8 fragment, UTF-16 fragment and UTF-32 fragment to express this concept.

code point: the technical name for the numerical value associated with a character. In Unicode, valid codepoints go from 0 to 0x10FFFF inclusive. In D, a codepoint can only be stored in a dchar.

character: officially, the smallest unit of textual information with semantic meaning. Practically speaking, this means either (a) a control code; (b) something printable; or (c) a combiner, such as an accent you can place over another character. Every character has a unique codepoint. Conversely, every codepoint in the range 0 to 0x10FFFF corresponds to a unique Unicode character. Unicode characters are often written in the form U+#### (for example, U+20AC, which is the character corresponding to codepoint 0x20AC).

As an observation, over 99% of all the characters you are likely to use, and which are involved in text processing, will occur in the range U+0000 to U+FFFF. Therefore an array of sixteen-bit values interpretted as characters will likely be sufficient for most purposes. (A UTF-16 string may be interpretted in this way). If you want that extra 1%, as some apps will, you'll need to go the whole hog and recognise characters all the way up to U+10FFFF.

grapheme: a printable base character which may have been modified by zero or more combining characters (for example 'a' followed by combining-acute-accent).

glyph: one or more graphemes glued together to form a single printable symbol. The Unicode character zero-width-joiner usually acts as the glue.

Unicode: a standard to generate a complete listing of all characters (glyphs, printable symbols) that are in use all over the world in all written languages. Unicode is published as book (4.0, 1500+ pages, $57, ISBN 0321185781) and on the internet [1]. A Unicode character definition connects a unique number (code U+####), a unique picture of the character, and a unique name for the character.

D Libraries

ICU Bindings for D (Mango.icu)
libiconv: http://www.algonet.se/~afb/d/libiconv.d ( NG:digitalmars.D/17913)
StringClasses
DsourceProject:deimos has some Unicode modules (in the future it may be moved to Phobos).

Links for learning about Unicode

(from digitalmars-d/2006-August/007205.html)

What are Unicode and UTF

(adapted from NG:digitalmars.D/11409)

Well, they are different kinds of objects. Unicode is a character set; UTF-16 is an encoding. Bear with me - I'll try to make that clearer.

A character set is a set of characters in which each character has a number associated with it, called its "codepoint". For example, in the ASCII character set, the character 'A' has a codepoint of 65 (more usually written in hex, as 0x41). In the Unicode character set, 'A' also has a codepoint of 65, and the character '�' (not present in ASCII) has a codepoint of 8,364 (more normally written in hex as 0x20AC).

Unicode characters are often written as U+ followed by their codepoint in hexadecimal. That is, U+20AC means the same thing as �.

Once upon a time, Unicode was going to be a sixteen-bit wide character set. That is, there were going to be (at most) 65,536 characters in it. Thus, every Unicode string would fit comfortably into an array of 16-bit-wide words.

Then things changed. Unicode grew too big. Suddenly, 65,536 characters wasn't going to be enough. But too many important real-life applications had come to rely on characters being 16-bits wide (for example: Java and Windows, to name a couple of biggies). Something had to be done. That something was UTF-16.

UTF-16 is a sneaky way of squeezing >65535 characters into an array originally designed for 16-bit words. Unicode characters with codepoints <0x10000 still occupy only one word; Unicode characters with codepoints >=0x10000 now occupy two words. (A special range of otherwise unused codepoints makes this possible).

In general, an "encoding" is a bidirectional mapping which maps each codepoint to an array of fixed-width objects called "code units". How wide is a code unit? Well, it depends on the encoding. UTF-8 code units are 8 bits wide; UTF-16 code units are 16 bits wide; and UTF-32 code units are 32 bits wide. So UTF-16 is a mapping from Unicode codepoints to arrays of 16-bit wide units. For example, the codepoint 0x10000 maps (in UTF-16) to the array [ 0xD800, 0xDC00 ].

You can learn all about this in much more detail here: http://www.unicode.org/faq/utf_bom.html

Newsgroup Threads Involving Unicode

Date	Subject	Author	Post
2007
14 Jun 2007	Working with utf	Simen Haugen	NG:digitalmars.D/54521
2006
29 Sep 2006	Re: First Impressions	Anders F Bj�rklund	NG:digitalmars.D/42479
26 Sep 2006	The origin of UTF-8	Georg Wrede	NG:digitalmars.D/42316
31 Jul 2006	To Walter, about char[] initialization by FF	WalterBright	NG:digitalmars.D/41071
31 Jul 2006	To Walter, about char[] initialization by FF	WalterBright	NG:digitalmars.D/41058
07 Jul 2006	convert ANSI to UTF-8	Geert	NG:digitalmars.D/39663
2005
18 Nov 2005	the D crowd does Rocket Science	Georg Wrede	NG:digitalmars.D.bugs/5570
25 Mar 2005	Who wrote libiconv.d?	Nick	NG:digitalmars.D/20151
25 Mar 2005	Changing to UTF-8	jicman	NG:digitalmars.D.learn/218
09 Mar 2005	To wchar or not to wchar?	John C	NG:digitalmars.D/18931
02 Mar 2005	Re: Solution to the encoding problem (libiconv)	Anders F Bj�rklund	NG:digitalmars.D/17913
02 Mar 2005	Solution to the encoding problem (locale_v0.1 - link may be dead)	Nick	NG:digitalmars.D/17911
21 Feb 2005	Error: 4invalid UTF-8 sequence	jicman	NG:digitalmars.D/17096
11 Feb 2005	Chars and Strs	Anders F Bj�rklund	NG:digitalmars.D/16403
2004
01 Dec 2004	writef doesn't work on Windows XP console	Roberto Mariottini	NG:digitalmars.D.bugs/2393
28 Nov 2004	ANNOUNCE: Mango.icu on Linux	John Reimer	NG:digitalmars.D/13103
28 Nov 2004	Error: invalid UTF-8 sequence	Carotinho	NG:digitalmars.D/13104
28 Nov 2004	iconv	Ben Hinkle	NG:digitalmars.D/13095
25 Nov 2004	ICU/unicode bindings for D	Kris	NG:digitalmars.D/13067
23 Nov 2004	8-bit character encodings	Anders F Bj�rklund	NG:digitalmars.D/12967
19 Nov 2004	Character encoding problem	Mathias Bierschenk	NG:digitalmars.D/12787
17 Nov 2004	Re: switch (dchar[])	Anders F Bj�rklund	NG:digitalmars.D.bugs/2294
25 Oct 2004	String theory in D	Glen Perkins	NG:digitalmars.D/12103
22 Oct 2004	char[] dstring = char* cstring ?	Anders F Bj�rklund	NG:digitalmars.D/12054
30 Sep 2004	national language support	novice	NG:digitalmars.D/11333
30 Sep 2004	std.file	novice2	NG:digitalmars.D/11320
25 Sep 2004	UTF-8 char[] consistency	Jaap Geurts	NG:digitalmars.D/11061
24 Sep 2004	char[] vs. ubyte[]	Arcane Jill	NG:digitalmars.D/11001
19 Sep 2004	Hexadecimal escapes don't encode into UTF-8	Burton Radons	NG:digitalmars.D.bugs/1874
23 Aug 2004	ICU (International Components for Unicode)	Arcane Jill	NG:digitalmars.D/9460
23 Aug 2004	The case for ditching char and wchar (and renaming "dchar" as "char")	Arcane Jill	NG:digitalmars.D/9451
15 Aug 2004	Transcoding - who's doing what?	Arcane Jill	NG:digitalmars.D/8844
13 Aug 2004	Only support for UTF-8?	Nick	NG:digitalmars.D.bugs/1365
05 Aug 2004	string\utf question	Lars Ivar Igesund	NG:digitalmars.D/8277
28 Jul 2004	UTF-8 to dchar conversion	Arcane Jill	NG:digitalmars.D/7481
28 Jul 2004	utf.d update	Sean Kelly	NG:digitalmars.D.bugs/1172
28 Jul 2004	UTF documentation	Andy Friesen	NG:digitalmars.D.bugs/1169
27 Jul 2004	Unicode Digits (was OT - scanf in Java)	Arcane Jill	NG:digitalmars.D/7318
27 Jul 2004	non-ascii names and recls	Carlos Santander B.	NG:digitalmars.D.bugs/1160
25 Jul 2004	Auto-UTF-detection - Feature Request	Arcane Jill	NG:digitalmars.D/7102
13 Jul 2004	UTF editors fears ??	Blandger	NG:digitalmars.D/5990
09 Jul 2004	isValidDchar error	Arcane Jill	NG:digitalmars.D.bugs/762
09 Jul 2004	UTF-32 bug	Arcane Jill	NG:digitalmars.D.bugs/761
27 Jun 2004	Unicode library now in Deimos	Arcane Jill	NG:digitalmars.D/4774
23 Jun 2004	.max (was Re: DMD 0.93 release)	Arcane Jill	NG:digitalmars.D/4462
09 Jun 2004	Re: DMD 0.92 release (but actually about Unicode)	Arcane Jill	NG:digitalmars.D/3559
08 Jun 2004	Unichar optimization: performance vs RAM usage.	Hauke Duden	NG:digitalmars.D/3371
05 Jun 2004	UTF-8 bug	Arcane Jill	NG:digitalmars.D/3113
04 Jun 2004	The Unicode Casing Algorithms	Arcane Jill	NG:digitalmars.D/2979
14 Apr 2004	Unicode	Scott Egan	NG:D/27473
11 Mar 2004	char	Jill.Ramonsky	NG:D/25378
2003
25 Dec 2003	UTF-8	ET_yoza	NG:D/20787
19 Dec 2003	Unicode Discussion	Rupert Millard	NG:D/20619
15 Dec 2003	Unicode Discussion	Elias Martenson	NG:D/20361
02 Dec 2003	String with Encoding (Suggestion)	Keisuke UEDA	NG:D/19662
02 Dec 2003	UNICODE operators (Suggestion)	Mark Brudnak	NG:D/19736
31 Mar 2003	Unicode Character and String Intrinsics	Mark Evans	NG:D/12382
06 Feb 2003	Unicode: Japanese Study Program	Andrew Edwards	NG:D/10786
16 Jan 2003	Unicode in D	globalization guy	NG:D/10001
2001
20 Sep 2001	Unicode ideas	Eric Gerlach	NG:D/1455

FolderDiscussions