Last update July 18, 2009

Unicode Issues

Table of contents of this page
D Libraries   
Links for learning about Unicode   
What are Unicode and UTF   
Newsgroup Threads Involving Unicode   

You should not be here, go here instead.


(adapted from NG:digitalmars.D/11247)

code unit: the technical name for a single primitive fragment of either UTF-8, UTF-16 or UTF-32 (that is, the value held in a single char, wchar or dchar). I tend to use the phrases UTF-8 fragment, UTF-16 fragment and UTF-32 fragment to express this concept.

code point: the technical name for the numerical value associated with a character. In Unicode, valid codepoints go from 0 to 0x10FFFF inclusive. In D, a codepoint can only be stored in a dchar.

character: officially, the smallest unit of textual information with semantic meaning. Practically speaking, this means either (a) a control code; (b) something printable; or (c) a combiner, such as an accent you can place over another character. Every character has a unique codepoint. Conversely, every codepoint in the range 0 to 0x10FFFF corresponds to a unique Unicode character. Unicode characters are often written in the form U+#### (for example, U+20AC, which is the character corresponding to codepoint 0x20AC).

As an observation, over 99% of all the characters you are likely to use, and which are involved in text processing, will occur in the range U+0000 to U+FFFF. Therefore an array of sixteen-bit values interpretted as characters will likely be sufficient for most purposes. (A UTF-16 string may be interpretted in this way). If you want that extra 1%, as some apps will, you'll need to go the whole hog and recognise characters all the way up to U+10FFFF.

grapheme: a printable base character which may have been modified by zero or more combining characters (for example 'a' followed by combining-acute-accent).

glyph: one or more graphemes glued together to form a single printable symbol. The Unicode character zero-width-joiner usually acts as the glue.

Unicode: a standard to generate a complete listing of all characters (glyphs, printable symbols) that are in use all over the world in all written languages. Unicode is published as book (4.0, 1500+ pages, $57, ISBN 0321185781) and on the internet [1]. A Unicode character definition connects a unique number (code U+####), a unique picture of the character, and a unique name for the character.

D Libraries    

Links for learning about Unicode    

(from digitalmars-d/2006-August/007205.html)


What are Unicode and UTF    

(adapted from NG:digitalmars.D/11409)

Well, they are different kinds of objects. Unicode is a character set; UTF-16 is an encoding. Bear with me - I'll try to make that clearer.

A character set is a set of characters in which each character has a number associated with it, called its "codepoint". For example, in the ASCII character set, the character 'A' has a codepoint of 65 (more usually written in hex, as 0x41). In the Unicode character set, 'A' also has a codepoint of 65, and the character '€' (not present in ASCII) has a codepoint of 8,364 (more normally written in hex as 0x20AC).

Unicode characters are often written as U+ followed by their codepoint in hexadecimal. That is, U+20AC means the same thing as €.

Once upon a time, Unicode was going to be a sixteen-bit wide character set. That is, there were going to be (at most) 65,536 characters in it. Thus, every Unicode string would fit comfortably into an array of 16-bit-wide words.

Then things changed. Unicode grew too big. Suddenly, 65,536 characters wasn't going to be enough. But too many important real-life applications had come to rely on characters being 16-bits wide (for example: Java and Windows, to name a couple of biggies). Something had to be done. That something was UTF-16.

UTF-16 is a sneaky way of squeezing >65535 characters into an array originally designed for 16-bit words. Unicode characters with codepoints <0x10000 still occupy only one word; Unicode characters with codepoints >=0x10000 now occupy two words. (A special range of otherwise unused codepoints makes this possible).

In general, an "encoding" is a bidirectional mapping which maps each codepoint to an array of fixed-width objects called "code units". How wide is a code unit? Well, it depends on the encoding. UTF-8 code units are 8 bits wide; UTF-16 code units are 16 bits wide; and UTF-32 code units are 32 bits wide. So UTF-16 is a mapping from Unicode codepoints to arrays of 16-bit wide units. For example, the codepoint 0x10000 maps (in UTF-16) to the array [ 0xD800, 0xDC00 ].

You can learn all about this in much more detail here:

Newsgroup Threads Involving Unicode    

14 Jun 2007Working with utfSimen Haugen NG:digitalmars.D/54521
29 Sep 2006Re: First ImpressionsAnders F Björklund NG:digitalmars.D/42479
26 Sep 2006The origin of UTF-8Georg Wrede NG:digitalmars.D/42316
31 Jul 2006To Walter, about char[] initialization by FFWalterBright NG:digitalmars.D/41071
31 Jul 2006To Walter, about char[] initialization by FFWalterBright NG:digitalmars.D/41058
07 Jul 2006convert ANSI to UTF-8Geert NG:digitalmars.D/39663
18 Nov 2005the D crowd does Rocket ScienceGeorg Wrede NG:digitalmars.D.bugs/5570
25 Mar 2005Who wrote libiconv.d?Nick NG:digitalmars.D/20151
25 Mar 2005Changing to UTF-8jicman NG:digitalmars.D.learn/218
09 Mar 2005To wchar or not to wchar?John C NG:digitalmars.D/18931
02 Mar 2005Re: Solution to the encoding problem (libiconv)Anders F Björklund NG:digitalmars.D/17913
02 Mar 2005Solution to the encoding problem (locale_v0.1 - link may be dead)Nick NG:digitalmars.D/17911
21 Feb 2005Error: 4invalid UTF-8 sequencejicman NG:digitalmars.D/17096
11 Feb 2005Chars and StrsAnders F Björklund NG:digitalmars.D/16403
01 Dec 2004writef doesn't work on Windows XP consoleRoberto Mariottini NG:digitalmars.D.bugs/2393
28 Nov 2004ANNOUNCE: on LinuxJohn Reimer NG:digitalmars.D/13103
28 Nov 2004Error: invalid UTF-8 sequenceCarotinho NG:digitalmars.D/13104
28 Nov 2004iconvBen Hinkle NG:digitalmars.D/13095
25 Nov 2004ICU/unicode bindings for DKris NG:digitalmars.D/13067
23 Nov 20048-bit character encodingsAnders F Björklund NG:digitalmars.D/12967
19 Nov 2004Character encoding problemMathias Bierschenk NG:digitalmars.D/12787
17 Nov 2004Re: switch (dchar[])Anders F Björklund NG:digitalmars.D.bugs/2294
25 Oct 2004String theory in DGlen Perkins NG:digitalmars.D/12103
22 Oct 2004char[] dstring = char* cstring ?Anders F Björklund NG:digitalmars.D/12054
30 Sep 2004national language supportnovice NG:digitalmars.D/11333
30 Sep 2004std.filenovice2 NG:digitalmars.D/11320
25 Sep 2004UTF-8 char[] consistencyJaap Geurts NG:digitalmars.D/11061
24 Sep 2004char[] vs. ubyte[]Arcane Jill NG:digitalmars.D/11001
19 Sep 2004Hexadecimal escapes don't encode into UTF-8Burton Radons NG:digitalmars.D.bugs/1874
23 Aug 2004ICU (International Components for Unicode)Arcane Jill NG:digitalmars.D/9460
23 Aug 2004The case for ditching char and wchar (and renaming "dchar" as "char")Arcane Jill NG:digitalmars.D/9451
15 Aug 2004Transcoding - who's doing what?Arcane Jill NG:digitalmars.D/8844
13 Aug 2004Only support for UTF-8?Nick NG:digitalmars.D.bugs/1365
05 Aug 2004string\utf questionLars Ivar Igesund NG:digitalmars.D/8277
28 Jul 2004UTF-8 to dchar conversionArcane Jill NG:digitalmars.D/7481
28 Jul 2004utf.d updateSean Kelly NG:digitalmars.D.bugs/1172
28 Jul 2004UTF documentationAndy Friesen NG:digitalmars.D.bugs/1169
27 Jul 2004Unicode Digits (was OT - scanf in Java)Arcane Jill NG:digitalmars.D/7318
27 Jul 2004non-ascii names and reclsCarlos Santander B. NG:digitalmars.D.bugs/1160
25 Jul 2004Auto-UTF-detection - Feature RequestArcane Jill NG:digitalmars.D/7102
13 Jul 2004UTF editors fears ??Blandger NG:digitalmars.D/5990
09 Jul 2004isValidDchar errorArcane Jill NG:digitalmars.D.bugs/762
09 Jul 2004UTF-32 bugArcane Jill NG:digitalmars.D.bugs/761
27 Jun 2004Unicode library now in DeimosArcane Jill NG:digitalmars.D/4774
23 Jun 2004.max (was Re: DMD 0.93 release)Arcane Jill NG:digitalmars.D/4462
09 Jun 2004Re: DMD 0.92 release (but actually about Unicode)Arcane Jill NG:digitalmars.D/3559
08 Jun 2004Unichar optimization: performance vs RAM usage.Hauke Duden NG:digitalmars.D/3371
05 Jun 2004UTF-8 bugArcane Jill NG:digitalmars.D/3113
04 Jun 2004The Unicode Casing AlgorithmsArcane Jill NG:digitalmars.D/2979
14 Apr 2004UnicodeScott Egan NG:D/27473
11 Mar 2004charJill.Ramonsky NG:D/25378
25 Dec 2003UTF-8ET_yoza NG:D/20787
19 Dec 2003Unicode DiscussionRupert Millard NG:D/20619
15 Dec 2003Unicode DiscussionElias Martenson NG:D/20361
02 Dec 2003String with Encoding (Suggestion)Keisuke UEDA NG:D/19662
02 Dec 2003UNICODE operators (Suggestion)Mark Brudnak NG:D/19736
31 Mar 2003Unicode Character and String IntrinsicsMark Evans NG:D/12382
06 Feb 2003Unicode: Japanese Study ProgramAndrew Edwards NG:D/10786
16 Jan 2003Unicode in Dglobalization guy NG:D/10001
20 Sep 2001Unicode ideasEric Gerlach NG:D/1455


FrontPage | News | TestPage | MessageBoard | Search | Contributors | Folders | Index | Help | Preferences | Edit

Edit text of this page (date of last change: July 18, 2009 7:31 (diff))