Sunday, 20 October 2013

What's new in Unicode 7.0 ?

Previously discussed :


[Update: Unicode 7.0 was released on 16 June 2014]


The two previous releases of Unicode (6.2 and 6.3) have been rather disappointing with regards to the number of new characters introduced into the standard (one in 6.2 and five in 6.3), so Unicode 7.0 should be much more exciting to those of us who think that 110,000 characters in Unicode are not nearly enough. In summary, 2,833 2,834* new characters are going to be added to Unicode 7.0 when it is released in the summer of 2014 (official beta information page for Unicode 7.0.0). Of these, 1,849 characters belong to 23 newly added scripts, which is a greater number of new scripts than for any previous version since Unicode 1.0 (which started life with 24 scripts).

* When I wrote this blog post there were going to be 2,833 new characters, but since then the newly invented Ruble sign has been fast-tracked for encoding in Unicode 7.0 at U+20BD.


23 new scripts in Unicode 7.0


Although all of these new scripts are either historical or have limited modern usage, and most people will be unfamilar with most of them, there are several important additions, notably Grantha and Siddham, as well as Linear A, which may be the first undeciphered writing system to be encoded in Unicode (depending upon whether the symbols on the Phaistos Disc, encoded in Unicode 5.1, represent writing or not).

Apart from the new scripts, the highlight of Unicode 7.0 for most people on the internet will be the addition of 643 wingdings, webdings and other pictographic symbols, which will supplement the emoticons, emoji and many other symbols added to Unicode 6.0. I predict that characters such as "Reversed Hand with Middle Finger Extended", "Reversed Victory Hand" (British equivalent of the finger), and "Raised Hand with Part Between Middle and Ring Fingers" (live long and prosper) will become even more popular on Twitter than the infamous "Pile of Poo" 💩 character*.

* Pile of Poo was encoded in the Unicode standard for compatibility with Japanese telecoms companies (KDDI & Softbank) which included it as part of the Emoji repertoire on their cell phones (see the original Emoji proposal where the character is provisionally named "Dung", later changed to "Pile of Poo" at the suggestion of Michael Everson).


FDAM2 code chart images of characters 1F594 through 1F596


However, the character that seems to be causing the most stir amongst the twitterati is U+1F574 "MAN IN BUSINESS SUIT LEVITATING". People are asking why Unicode has seen fit to encode this particular character. The answer is that in 2011 my good friend Michel Suignard (and project editor of ISO/IEC 10646) proposed to encode the set of symbols used in the widely-used Wingdings and Webdings fonts that were not already in Unicode or unifiable with an existing character. The Webdings font that ships with Microsoft Windows includes a glyph for a man in a business suit apparently levitating at U+F06D () (also accessible as "m" m unless you are using Firefox), and it is being encoded in Unicode 7.0 simply because the glyph is in the Webdings font and it is not unifiable with any existing Unicode character. So if you still want to know why Unicode 7.0 will include a character for MAN IN BUSINESS SUIT LEVITATING you had better ask Vincent Connare et al. why they included the glyph in Webdings in 1997 in the first place.*

* According to Microsoft's Webdings page: Our team of iconographers traveled the world asking site designers and users which symbols, icons and pictograms they thought would be most appropriate for a font of this kind. From thousands of suggestions we had to pick just two hundred and thirty for inclusion in Webdings.

** According to Jen Sorenson, in this blog post from 2009, the Man in Business Suit Levitating glyph in the Webdings font was intended to be an exclamation mark in the style of the rude boy logo found on records by The Specials published under the 2 Tone Records label. So perhaps the Unicode character would have been better named Rude Boy Exclamation Mark. Thanks to Ted Mielczarek for pointing this out to me.


BabelMap showing Webdings character F06D



Unicode and ISO/IEC 10646

Many people seem to think that characters are randomly added to the Unicode standard at a whim, and I can understand why it sometimes seems like that to an outside observer, but in fact the process of adding characters is far from simple. The Unicode standard is synchronized with the international standard, ISO/IEC 10646 ("Information technology—Universal Multiple-Octet Coded Character Set (UCS)"), and the contents of each version of the Unicode standard are largely determined by the committee work and balloting process for ISO/IEC 10646 by national standardization organizations (such as ANSI, BSI, DIN), although as the Unicode Consortium is represented on the committee responsible for ISO/IEC 10646 directly as a liaison member and indirectly via the US national body, it plays a very important role in this process (for more information on the relationship between the Unicode and ISO/IEC 10646 standards, see my blog post on Unicode and ISO/IEC 10646).

Unicode 6.1, released in January 2012, corresponds to ISO/IEC 10646:2012, which was published in June 2012 (freely available from the ISO web site as a set of PDF files and a set of electronic inserts). Amendment 1 to ISO/IEC 10646:2012 was published earlier this year, and one character only from Amd.1 (the Turkish Lira Sign) was added to the Unicode standard in version 6.2 released in September 2012. Amendment 2 to ISO/IEC 10646:2012 is currently in its final stage of balloting, and will be published late this year or early next year. Five characters only from Amd.2 (Arabic Letter Mark, Left-To-Right Isolate, Right-To-Left Isolate, First Strong Isolate, Pop Directional Isolate) were added to the Unicode standard in version 6.3 released at the end of September 2013. The repertoire of Unicode 7.0 will correspond to ISO/IEC 10646:2012 plus Amendments 1 and 2, and so the new characters encoded in 7.0 will correspond to those added to Amendment 1 (1,769 characters) and Amendment 2 (1,070 characters), minus the six characters already added in 6.2 and 6.3 (1,769 + 1,070 - 6 = 2,833 new characters in Unicode 7.0).



Amendment 1

Amendment 1 ("Linear A, Palmyrene, Manichaean, Khojki, Khudawadi, Bassa Vah, Duployan, and other characters") has already been published, so no changes to character allocations or character names in Unicode can be made. This amendment includes 1,769 new characters, as detailed in the tables below. You can download code charts covering the new characters from here or here.


Additions to Existing Blocks (339 characters)
Block Characters Documents
Greek and Coptic
[0370..03FF]
037F: Capital letter yot N3997
Armenian
[0530..058F]
058D..058E: 2 Armenian eternity signs N3923
Arabic
[0600..06FF]
0605: Mark used with Coptic numbers N3843
N3990
Arabic Extended-A
[08A0..08FF]
08A1: 1 letter used for Fulfulde N3882
N3988
08AD..08B1: 5 letters used for Bashkir, Belarusian, Crimean Tatar, and Tatar languages N4072
08FF: 1 letter used for Palula and Shina N4072
Devanagari
[0900..097F]
0978: 1 letter used for Marwari N3970
Telugu
[0C00..0C7F]
0C00: Candrabindu N3964
Kannada
[0C80..0CFF]
0C81: Candrabindu N3964
Malayalam
[0D00..0D7F]
0D01: Candrabindu N3964
Sinhala
[0D80..0DFF]
0DE6..0DEF: 10 digits for astrological use N3888
Limbu
[1900..194F]
191D..191E: 2 consonant conjuncts N3975
Combining Diacritical Marks Supplement
[1DC0..1DFF]
1DE7..1DF4: 14 combining letters used for Teuthonista phonetic transcription N4081
N4106
Currency Symbols
[20A0..20CF]
20BA: Turkish Lira sign (Unicode 6.2) N4273
Miscellaneous Technical
[2300..23FF]
23F4..23FA: 7 wingdings and webdings symbols N4022
N4115
Dingbats
[2700..27BF]
2700: 1 Wingdings and Webdings symbol N4022
N4115
Miscellaneous Symbols and Arrows
[2B00..2BFF]
2B4D..2B4F, 2B5A..2B73, 2B76..2B95, 2B98..2BB9, 2BBD..2BC8, 2BCA..2BD1: 115 wingdings and webdings symbols N4022
N4115
Supplement Punctuation
[2E00-2E7F]
2E3C: Stenographic full stop N3895
2E3D..2E3E: 2 marks for Lithuanian dialectology N4070
2E3F: Capitulum N4022
2E40: Double hyphen N3983
2E41..2E42: 2 marks for Old Hungarian N3664
Cyrillic Extended-B
[A640..A69F]
A698..A69B: 4 early Cyrillic letters N3974
A69C..A69D: 2 modifier letters used for Lithuanian dialectology N4070
Latin Extended-D
[A720..A7FF]
A794..A795: 2 letters used for Lithuanian dialectology N4070
A798..A79F: 8 letters used for Teuthonista phonetic transcription N4081
N4106
Combining Half Marks
[FE20..FE2F]
FE27..FE2D: 7 combining half marks N4078
Old Italic
[10300..1032F]
1031F: 1 letter used in a South Picene inscription N4046
Enclosed Alphanumeric Supplement
[1F100..1F1FF]
1F10B..1F10C: 2 wingdings and webdings symbols N4022
N4115
Miscellaneous Symbols and Pictographs
[1F300..1F5FF]
1F321..1F32C, 1F336, 1F394..1F395, 1F397, 1F39C..1F39D, 1F3F1..1F3F6, 1F441, 1F53E..1F53F, 1F544..1F54A, 1F568..1F56A, 1F56D..1F56F, 1F571, 1F573, 1F577..1F578, 1F57B, 1F57D..1F57F, 1F582..1F587, 1F589..1F593, 1F597..1F5A3, 1F5A5..1F5BB, 1F5BF..1F5C1, 1F5C4..1F5D1, 1F5D4..1F5DB, 1F5F4..1F5FA: 133 wingdings and webdings symbols N4022
N4115
N4239
Emoticons
[1F600..1F64F]
1F641..1F642: 2 wingdings and webdings symbols N4022
N4115
Transport and Map Symbols
[1F680..1F6FF]
1F6C6..1F6CA, 1F6E0: 6 wingdings and webdings symbols N4022
N4115

Linear A tablet at the Chania Archaeological Museum

{CC BY-SA 3.0 by Ursus}


New Blocks (1,430 characters)
Block Characters Documents
Combining Diacritical Marks Extended
[1AB0..1AFF]
1AB0..1ABE: 15 marks for Teuthonista phonetic transcription N4081
N4106
Myanmar Extended-B
[A9E0..A9FF]
A9E0..A9E6: 7 letters used for Shan Pali N3906
Latin Extended-E
[AB30..ABBF]
AB30..AB5F: 48 letters used for Teuthonista phonetic transcription N4081
N4106
Coptic Epact Numbers
[102E0..102FF]
102E0..102FB: 28 numbers used in Coptic-Arabic manuscripts N3843
N3990
Elbasan
[10500..1052F]
10500..10527: 40 letters used for the Elbasan script N3985
Linear A
[10600..107FF]
10600..10736, 10740..10755, 10760..10767: 341 Linear A signs N3973
Palmyrene
[10860..1087F]
10860..1087F: 32 letters used for the Palmyrene script N3867
Nabataean
[10880..108AF]
10880.. 1089E, 108A7.. 108AF: 40 letters and numbers used for the Nabataean script N3969
Old North Arabian
[10A80..10A9F]
10A80..10A9F: 32 letters and numbers used for the Old North Arabian script N3937
Manichaean
[10AC0..10AFF]
10AC0..10AE6, 10AEB..10AF6: 51 letters, numbers and punctuation marks used for the Manichaean script N4029
Sinhala Archaic Numbers
[111E0..111FF]
111E1..111F4: 20 archaic numbers N3876
N3888
Khojki
[11200..1124F]
11200..11211, 11213..1123D: 61 letters, signs and punctuation marks used for the Khojki script N3978
Khudawadi
[112B0..112FF]
112B0..112EA, 112F0..112F9: 69 letters signs and numbers used for the Khudawadi script N3979
Tirhuta
[11480..114DF]
11480..114C7, 114D0..114D9: 82 letters, signs and numbers used for the Tirhuta script N4035
Pau Cin Hau
[11AC0..11AFF]
11AC0..11AF8: 57 letters and other characters used for the Pau Cin Hau script N4017
Mro
[16A40..16A6F]
16A40..16A5E, 16A60..16A6F: 43 letters, numbers and punctuation marks used for the Mro script N3589
Bassa Vah
[16AD0..16AFF]
16AD0..16AED, 16AF0..16AF5: 36 letters and other characters used for the Bassa Vah script N3941
Duployan
[1BC00..1BC9F]
1BC00..1BC6A, 1BC70..1BC7C, 1BC80..1BC88, 1BC90..1BC99, 1BC9C..1BC9F: 143 letters and other characters for Duployan shorthand N3895
Shorthand Format Controls
[1BCA0..1BCAF]
1BCA0..1BCA3: 4 shorthand format characters N3895
Ornamental Dingbats
[1F650..1F67F]
1F650..1F67F: 48 wingdings and webdings symbols N4022
N4115
Geometric Shapes Extended
[1F780..1F7FF]
1F780..1F7D4: 85 wingdings and webdings symbols N4022
N4115
Supplemental Arrows-C
[1F800..1F8FF]
1F800..1F80B, 1F810..1F847, 1F850..1F859, 1F860..1F887, 1F890..1F8AD: 148 wingdings and webdings symbols N4022
N4115


Amendment 2

Amendment 2 ("Caucasian Albanian, Psalter Pahlavi, Mahajani, Grantha, Modi, Pahawh Hmong, Mende Kikakui, and other characters") is currently undergoing its final round of balloting, but at this stage no changes to character allocations or character names in Unicode can be made. This amendment includes 1,070 new characters, as detailed in the tables below. You can download code charts covering the new characters from here or here.


Medieval Celtic stone inscribed SABIN{I} FIL{I} MACCODECHET{I}

{CC BY-SA 3.0 by BabelStone}


Additions to Existing Blocks (248 characters)
Block Characters Documents
Cyrillic Supplement
[0500..052F]
0528..0529: 2 letters used for Orok N4137
052A..052D: 4 letters used for Ossetian and Komi N4199
052E..052F: 2 letters used for Northern Khanty, Eastern Khanty and Forest Nenets N4219
Arabic
[0600..06FF]
061C: Arabic letter mark (Unicode 6.3) N4180
Arabic Extended-A
[08A0..08FF]
08B2: 1 letter for Berber N4271
Bengali
[0980..09FF]
0980: Anji sign N4157
Telugu
[0C00..0C7F]
0C34: Letter llla N4214
Runic
[16A0..16FF]
16F1..16F3: 3 letters used by J. R. R. Tolkien
16F4..16F8: 5 letters used on the Franks Casket
N4013
Vedic Extensions
[1CD0..1CFF]
1CF8..1CF9: 2 svara markers for the Jaiminiya Sama Veda Archika N4134
Combining Diacritical Marks Supplement
[1DC0..1DFF]
1DF5: 1 character used in American lexicography N4279
General Punctuation
[2000..206F]
2066..2069: 4 bidirectional format characters (Unicode 6.3) N4279
Currency Symbols
[20A0..20CF]
20BB: Nordic mark sign N4308
N4377
20BC: Azerbaijani Manat sign N4168
Latin Extended-D
[A720..A7FF]
A796..A797: 2 letters used for Middle Vietnamese
A7AB..A7AC: 2 letters required for casing
A7F7: 1 letter used in Celtic inscriptions
N4030
A7B0..A7B1: 2 letters used in Americanist orthographies N4297
A7AD: 1 letter used for Alabama N4228
Myanmar Extended-B
[A9E0..A9FF]
A9E7..A9FE: 24 letters and numbers used for Tai Laing N3976
Myanmar Extended-A
[AA60..AA7F]
AA7C..AA7D: 2 signs used for Tai Laing
AA7E..AA7F: 2 letters used for Shwe Palaung
N3976
Latin Extended-E
[AB30..ABBF]
AB64..AB65: 2 letters used for phonetic transcription N4307
Ancient Greek Numbers
[10140..1018F]
1018B..1018C, 101A0: 3 papyrological characters N4194
Brahmi
[11000..1107F]
1107F: Number joiner N4166
Sharada
[11180..111DF]
111CD: Sutra mark N4269
111DA: Ekam sign N4158
Cuneiform
[12000..123FF]
1236F..12398, 12463..1246E, 12474: 55 signs and numeric signs N4277
Playing Cards
[1F0A0..1F0FF]
1F0BF, 1F0E0..1F0F5: 23 playing card symbols N4089
Miscellaneous Symbols and Pictographs
[1F300..1F5FF]
1F37D, 1F396, 1F398..1F39B, 1F39E..1F39F, 1F3C5, 1F3CB..1F3CE, 1F3D4..1F3DF, 1F3F7, 1F43F, 1F4F8, 1F4FD..1F4FE, 1F56B..1F56C, 1F570, 1F572, 1F574..1F576, 1F579, 1F57C, 1F580..1F581, 1F588, 1F594..1F596, 1F5BC..1F5BE, 1F5C2..1F5C3, 1F5D2..1F5D3, 1F5DC..1F5F3: 76 wingdings and webdings symbols N4022
N4115
N4239
N4306
Transport and Map Symbols
[1F680..1F6FF]
1F6CB..1F6CF, 1F6E1..1F6EC, 1F6F0..16F3: 21 wingdings and webdings symbols N4022
N4115

Sanskrit Dhāraṇī in Chinese and Siddham scripts from Yarkhoto

IDP: Berlin-Brandenburgische Akademie der Wissenschaften: SHT 7175


New Blocks (822 characters)
Block Characters Documents
Old Permic
[10350..1037F]
10350..1037A: 43 letters used for the Old Permic script N4263
Caucasian Albanian
[10530..1056F]
10530..10563, 1056F: 53 letters and marks used for the Caucasian Albanian script N4131
Psalter Pahlavi
[10B80..10BAF]
10B80..10B91, 10B99..10B9C, 10BA9..10BAF: 29 letters, marks and numbers used for the Psalter Pahlavi script N4040
Mahajani
[11150..1117F]
11150..11176: 39 letters and signs used for the Mahajani script N4126
Grantha
[11300..1137F]
11301..11303, 11305..1130C, 1130F..11310, 11313..11328, 1132A..11330, 11332..11333, 11335..11339, 1133C..11344, 11347..11348, 1134B..1134D, 11357, 1135D..11363, 11366..1136C, 11370..11374: 83 letters, numbers and signs used for the Grantha script N4135
N4136
Siddham
[11580..115FF]
11580..115B5, 115B8..115C9: 72 letters, signs and marks used for the Siddham script N4294
Modi
[11600..1165F]
11600..11644, 11650..11659: 79 letters, signs and numbers used for the Modi script N4034
Warang Citi
[118A0..118FF]
118A0..118F2, 118FF: 84 letters and numbers used for the Warang Citi script N4259
Pahawh Hmong
[16B00..16B8F]
16B00..16B45, 16B50..16B59, 16B5B..16B61, 16B63..16B77, 16B7D..16B8F: 127 letters and signs used for the Pahawh Hmong script N4175
N4377
Mende Kikakui
[1E800..1E8DF]
1E800..1E8C4, 1E8C7..1E8D6: 213 syllables and numbers used for the Mende Kikakui script N4167
N4311
N4377


On beyond 7.0

A new (4th) edition of ISO/IEC 10646 will be published next year, and Amendment 1 to this new edition is already in progress. ISO/IEC 10646:2014 (draft code charts) will include Hatran, Old Hungarian (assuming that the Hungarian national body's ballot response is positive), Sharada, Multani, Ahom, Early Dynastic Cuneiform, Anatolian Hieroglyphs, and Sutton Signwriting, as well as 5,762 Han ideographs in a new CJK-E block. Amendment 1 (draft code charts) currently adds Nüshu (Nushu) and Tamil supplement, but more scripts may be added to it as it progresses. The character repertoire, code point allocations, and character names are not yet fixed, and the draft code charts linked to above should be treated with caution.

For the first time, in what I think is a very good move, the Unicode Consortium has publicized the ISO ballots in advance of announcing a beta version of Unicode (at which point it is too late to make changes to character allocation and character names), and requested feedback from the public on the proposed repertoires. See PRI #256 for ISO/IEC 10646:2014 and PRI #255 for ISO/IEC 10646:2014 Amd.1. New scripts and characters added to ISO/IEC 10646:2014 and its amendments will feed into Unicode 7.1 and 7.2 (these are probable version numbers, but are currently unconfirmed) during the next two or three years.

For those of you who have been following the yo-yoing progress of the middle dot letter used for Sinological transcription and 'Phags-pa transliteration (originally proposed for encoding by myself in January 2009, and subsequently put on and then taken off virtually every ballot since then), an agreement was finally reached at the last WG2 meeting in Vilnius during the summer of this year to encode the character at U+A78F under the compromise name of LATIN LETTER SINOLOGICAL DOT, and I hope to see it encoded in the version of Unicode corresponding to ISO/IEC 10646:2014 Amd.1 (it's not currently on Amd.1, but maybe it will get added there).

Tangut is a major historic script that I know that many people want to see encoded in Unicode, and as the main author of a series of proposals to encode Tangut characters and Tangut components I am top this list. However, although the first proposal to encode Tangut characters (by Richard Cook) was made in 2008, it has proved very hard to reach an agreement on character repertoire, and Tangut encoding has floundered. A conference on encoding Tangut, supported by a grant from the Henry Luce Foundation, will be held in Beijing in December of this year (I will be there), and if all goes well it is possible that Tangut could be put on the ballot for ISO/IEC 10646:2014 Amd. 2, and find its way on into Unicode 7.2 or 8.0.



Fonts Supporting Unicode 7.0


33 comments:

Miguel Farah said...

Thanks for the detailed report!

I'll begin working on new versions of my keyboard layout.

By the way, would it be too late to add NEW characters? I think I've found evidence of the $ (with TWO bars) used as a character semantically distinct of the regular (one-bar) one.

Andrew West said...

Too late for 7.0. Disunifying the double-barred $ sign from the single-barred $ sign would be very problematic because of the extremely widespread use of the $ sign with either one or two bars as the same currency sign. So, even if there is evidence that in a certain context the two signs are semantically distinct, I think it highly unlikely that the UTC or WG2 would accept the disunification. Of course, you won't know unless you submit a proposal.

R.S. Wihananto said...

Very interesting and detailed post, Mr. West. I don't see Marshallese invariant cedilla characters mentioned here. I'm curious what's their solution for cedilla and comma below problem in Marshallese and Latvian?

Andrew West said...

The issue was discussed this June at the WG2 meeting in Vilnius (see Latvian and Marshallese Ad Hoc Report), and as a result a proposal to encode four non-decomposable letters with cedillas for use with Marshallese was produced. Some opposition to this solution has been expressed in some quarters, but given the marginal importance of Marshallese and the necessity not to destabilize Latvian data, I believe that this is the best solution available. You will have to wait until after Unicode 7.0 to see what the eventual outcome is.

Miguel Farah said...

I know I'll face an uphill battle with the double-bar $ symbol (I'd probably have an easier time proposing the addition of any Iberian or Tartessian script). That's why I'm trying to gather as big of a corpus of evidence as I can before writing a proposal.

ievlampiev said...

So, what about

LATIN SMALL LETTER IOTIFIED E
LATIN SMALL LETTER OPEN OE
LATIN SMALL LETTER UO,

will they go to Unicode 7.0 ?

Andrew West said...

No, the four characters LATIN SMALL LETTER SAKHA YAT, LATIN SMALL LETTER IOTIFIED E, LATIN SMALL LETTER OPEN OE and LATIN SMALL LETTER UO that you proposed are scheduled for inclusion in ISO/IEC 10646:2014 (see Draft additional repertoire for ISO/IEC 10646:2014 (4th edition) page 13), and so will be in the version of Unicode after 7.0, which will probably be released in 2015.

Douglas McClean said...

To my eye, it looks like it might be a "man in business suit being an exclamation mark", but I'm not sure that I would find any more use for that in my writing than for a levitating man.

Given the resemblance to the exclamation mark, the hat, and the sunglasses I would say that "HEISENBERG!" or "HEISENBERG, B----!" would also be good alternatives. ;)

David Lasher said...

Looks like they'll need to add another Formal Alias.

For 2B81 UPWARDS TRIANGLE-HEADED ARROW LEFTWARDS DOWNWARDS OF TRIANGLE-HEADED ARROW, the words "DOWNWARDS OF" should have been transposed. (Compare to 2B83)

Andrew West said...

Oh dear me.

Elliot Wallace said...

Hmm, the levitating man comes out as a really big lowercase M for me… ???

Andrew West said...

If you see an m instead of a man in a business suit levitating then you probably do not have the "Webdings" font installed on your system (which is quite possible as Webdings is a Microsoft font, so it probably won't be installed on non-Microsoft operating systems). The fact that my attempts to interchange "man in a business suit levitating" fails for some or many readers of this blog provides a good explanation of why encoding characters such as this in Unicode is a useful endeavour, and is not an indication that Unicode has jumped the shark as some have suggested. Although many people on Twitter have suggested that this is a pointless character to encode, it is in a widely used font so some people somewhere in the world will have used it, and will want to interchange it with others, and the only way they can safely do so is if it is encoded in Unicode.

Damien said...

Just a thought, but if MAN IN BUSINESS SUIT LEVITATING was added in 1997, wasn't that the year that Men in Black came out? Could be a direct HiB homage.

Andrew West said...

That sounds plausible to me. According to Microsoft's Webdings page: "Our team of iconographers traveled the world asking site designers and users which symbols, icons and pictograms they thought would be most appropriate for a font of this kind. From thousands of suggestions we had to pick just two hundred and thirty for inclusion in Webdings." So there must have been some rationale for choosing the levitating man over the thousands of rejected suggestions. Someone somewhere must know the answer.

The_Decryptor said...

Even with the Webdings/Wingdings font installed you're not likely to see the special glyph for "m" in modern browsers, Firefox and Safari (At least, I assume Chrome behaves the same) special case them so that they don't get used for normal text rendering, so you have to supply the specific PUA codepoint instead if you want to use it.

<p style="font-family: webdings">&#xF06D;</p>

That works for me in Firefox on OS X.

Andrew West said...

I originally used the PUA character (U+F06D), which works fine in IE, but that did not work in Chrome, so I changed it to "m" which works OK in both IE and Chrome. Unfortunately I did not test with Firefox, and I now see that Firefox does not like the "m". I think the simplest solution will be to use a glyph image instead of either F06D or "m".

Ted Mielczarek said...

This blogger claims (via a secondhand inquiry) to have contacted one of the Webdings authors, and that their response was that it was intended to be a rude boy (of the ska variety) posing as an exclamation mark. Very odd. :)

Andrew West said...

Insanely brilliant.

Oliver Camilo said...

Has anyone been able to access the images for the new emojis? I know when Unicode 6.0 came out, people were able to get the new emojis they added then about 6 months before their actual integration into iOS/OSX. Any idea how to get these new ones, or when they might become available?

ksec said...

I am trying to find answer and I hope you could point me to the right direction.

Why are some CJK characters defined with different Unicode when they are the same words and some are defined by font types.

Andrew West said...

The original set of CJK characters defined for version 1.0 of the Unicode Standard (i.e. U+4E00 through U+9FA5 in the CJK Unified Ideographs block) were based on a unification of characters encoded in various national standards. In order to allow for the round-trip conversion between legacy encodings and Unicode (and thereby facilitate the adoption of the new Unicode Standard), a "source separation rule" was applied. This rule meant that glyph variants of CJK characters which would be expected to be unified were encoded separately if the glyph variants were encoded separately in any of the national standards used as sources for the CJK repertoire. The result of this rule is that are quite a few separately encoded character variants in the basic CJK Unified Ideographs block (e.g. 溈 and 潙, 說 and 説). The source separation rule was not applied to later additions to the Unicode Standard (CJK-A and later), although character variants that have different components or have a different layout of the same components are encoded separately. If you want to understand Unicode encoding of CJK characters you should read the Chapter 12 of the Unicode Standard.

Frédéric Grosshans-André said...

Your list misses U+A7AD LATIN CAPITAL LETTER L WITH BELT, used in the Alabam language. The proposal was here http://www.unicode.org/L2/L2012/12080-l-with-belt.pdf .

By the way, thanks for your blog, I really enjoy reading your posts.

Andrew West said...

Hi Frédéric, thanks for your kind comment. I I'm pretty sure that this post includes all new Unicode 7.0 characters (except for the Ruble sign which was added later), and if you search for "A7AD" you should find it listed.

Michael Everson said...

I suppose it ought to be pointed out that those three emoji hands you mention were also proposed by me, via Irish ballot comments. Fame or infamy?

Alex said...

Latin Letter Middle Dot is in the draft additional repertoire for ISO/IEC 10646:2014.

Kokopäiväinen skeptikko said...

Where can I propose new characters for Unicode?

- Bitcoin symbol
- Litecoin symbol
- Klingon characters (the original proposal was rejected)

Andrew West said...

Please see Submitting Character Proposals on the Unicode site. Be warned that the process for getting characters of scripts encoded is long and complicated, involving both the Unicode committee and the corresponding ISO committee; and it may take several years and a personal appearance at at least one committee meeting, and then the proposed characters have to go through the ISO balloting process before they are finally accepted into Unicode. The proposal cannot just say "this is useful character that everyone will use when it is encoded", but has to provided detailed evidence of current usage and rationale for encoding.

Jim Monty said...

I read the original proposal to include the character that is now REVERSED HAND WITH MIDDLE FINGER EXTENDED (U+1F595) online. I've searched, but I can't find it now. As he indicated here last month, Michael Everson wrote it in an "Irish ballot." This is the document I remember reading months ago, but can't locate now. Can someone please provide me the URL of this document? Thanks.

Andrew West said...

I remember informally discussing the finger characters with Michael Everson a few years ago, in the context of emoji additions, but as far as I can see the first request to encode the character is in Irish ballot comments to PDAM 1.2 in February 2011. Ireland repeats the request in ballot comments to PDAM 2 in June 2012, this time with detailed justification and examples (in the wake of pushback from the Unicode Technical Committee). This is perhaps the document you are thinking about.

David Corbett said...

The blog post explaining the rude boy exclamation mark has disappeared, but it is archived here.

MiGrant said...

Any plans to include the Voynich Manuscript script?

Andrew West said...

There are currently no plans to encode the Voynich alphabet in Unicode. It is not on the roadmap, and no-one has submitted an encoding proposal for it -- and apart from a couple of disparaging remarks in 1999 and 2005 it has not even been discussed on the Unicode mailing list. With our current lack of understanding of the Voynich alphabet it is unlikely that it would be accepted for encoding by the committees, but if it could be proved (and that proof widely accepted) that Voynich does represent natural language and is not a cipher for an existing script, then I think it would be a good candidate for encoding. On the other hand, encoding policy has become noticeably more liberal in recent years, and it is always possible that a well-written proposal with compelling justification for why a character encoding of Voynich is required by scholars might eventually be successful. But unless someone puts the effort into writing and championing such a proposal we will never know.

Ethan Smith said...

Buy & sell new and used iPhones.
Wide selection of new and used iPhones for sale.
More at used iphone for sale no contract