N0601 - 9573-13 comments from Unicode

ISO/IEC JTC 1/SC 34N0601

ISO/IEC JTC 1/SC 34

Information Technology --
Document Description and Processing Languages

TITLE:	Result of a review of PDTR 9573-13 - The Unicode Consortium
SOURCE:	Asmus Freytag, Technical Vice President, The Unicode Consortium
PROJECT:	PDTR 9573-13:2004 (type 3) 2nd Edition: Information technology -- SGML support facilities -- Techniques for using SGML -- Part 13: Public entity sets for SGML -- for mathematics and science
PROJECT EDITOR:	Dr. David P. Carlisle
STATUS:	Liaison contribution
ACTION:	For consideration by project team
DATE:	2005-03-15
DISTRIBUTION:	SC34 and Liaisons
REFER TO:	N0599 - 2005-02-24 - PDTR 9573-13: Maths and scientific character sets
REPLY TO:	Dr. James David Mason (ISO/IEC JTC 1/SC 34 Chairman) Y-12 National Security Complex Bldg. 9113, M.S. 8208 Oak Ridge, TN 37831-8208 U.S.A. Telephone: +1 865 574-6973 Facsimile: +1 865 574-1896 Network: [email protected] http://www.y12.doe.gov/sgml/sc34/ ftp://ftp.y12.doe.gov/pub/sgml/sc34/ Mr. G. Ken Holman (ISO/IEC JTC 1/SC 34 Secretariat - Standards Council of Canada) Crane Softwrights Ltd. Box 266, Kars, ON K0A-2E0 CANADA Telephone: +1 613 489-0999 Facsimile: +1 613 489-0995 Network: [email protected] http://www.jtc1sc34.org

Background

The Unicode Consortium and NCITS/L2 have been following the work on PDTR 9547-13 with some interest for a while. As it involves mapping to Unicode characters we are naturally interested in making sure that the mappings go to the correct characters, in other words, that the entity mappings should not inadvertently imply something about the identity of Unicode characters which contradicts the Unicode Standard (and ISO/IEC 10646).

There has been a long history of active collaboration between the Unicode Consortium and several experts interested in technical and mathematical publishing, including David Carlisle. A result of this collaboration was a large extension of the repertoire of mathematical and technical characters in the Unicode Standard, starting with Unicode versions 3.1 and 3.2. This has been followed more recently by proposals for the addition of several remaining and characters to ISO/IEC 10646 needed to complete the mapping of public entity sets. The latest two sets of such character additions are slated for AMD1 and AMD2 of ISO/IEC 10646:2003 respectively. AMD1 is about to enter FDAM ballot, and AMD2 will be entering FPDAM ballot this spring. They will be synchronized to Unicode 4.1 (3/31/05) and Unicode 5.0 (2006), respectively.

I have been given the action by NCITS/L2 (action L2A-194.1) and the Unicode Technical Committee to prepare a review of the proposed mappings, which I hereby submit as a liaison contribution to SC34. We hope that SC34 will find the information and comments in this review helpful in preparing DTR 9537-13.

Comment 1: Normative References.

The draft references editions of 10646 which are no longer the most recently published ones. As character codes have not been removed, nor characters renamed, the normative reference could be immediately updated to

ISO/IEC 10646:2003

which is the most recent version published and which corresponds to Unicode 4.0.

There are two amendments to 10646 in various stages of balloting in SC2. Both of these are adding characters explicitly to allow better mapping to entity sets in
9573-13.

These amendments are:

ISO/IEC 10646:2003/Amd 1:2005 (estimated, about to enter FDAM ballot)
ISO/IEC 10646:2003/Amd 2:2006 (estimated, about to enter FPDAM ballot)

Comment 2: Section on General Considerations

a) The TR should no longer be aligned with Unicode 3.2, but at the minimum with Unicode 4.0, as that is the version corresponding to ISO/IEC 10646:2003. Unicode 4.1 (to be released 3/31/2005) will be synchronized with Amd1 and Unicode 5.0 (fall 2006) will be synchronized to Amd2 of 10646.

b) The text in this section is confusing. The word 'standard' seems to apply both to 10646 and to 9573 without clear reference. The word 'character' is used both to refer to the character names of 10646 and to the 'entity name' in section 7 of the PDTR. As TR9573 does not define characters, but entities, perhaps this section should be revised to refer to "entity names" where 9573 is concerned and "character names" where 10646 is concerned.

Comment 3: Section 4.1

There is no text for this section in this version of the draft

Comment 4: Section 4.2

Note that the use of 5-digit notation for code positions in 10646 is non-standard. commonly, a variable length 4-6 digit notation is used.

Comment 5: Differences between MathML and STIX Data

If the mapping in section 7 agrees with what we consider the preferred mapping, or if we have no opinion on the discrepancy, we have not always commented on each discrepancy. Where we comment on a mapping in this section, we will not comment again on the same entity in section 7

a) The mapping of [cudarrr] to U+2935 does seem inconsistent with the mappings for entities of related names. U+2939 seems more appropriate in that context. If the current mapping is retained, it would be advisable to include a rationale for that choice.

b) If the mapping of [cudarr] is changed based on comment 5a, then U+2939 is no longer available for mapping [larrpl], and the mapping to U+2946 would automatically become the preferred one. The mapping to U+2946 also seems the more appropriate based on entities with related names. If the current mapping is retained, it would be advisable to include a rationale for that choice.

c) The note in the entry for [midast] is obviously a mistake. However, if U+2217 is not otherwise mapped, it would seem a better mapping for this particular entity.

d) The STIX mapping for [veebar] to 2A61 seems inappropriate, in the context of barvee. (There is no Unicode character with a line above a small v). The current mapping to U+22BB should be retained.

e) The STIX mapping for [xcirc] seems inappropriate. The current mapping to U+25EF should be retained. U+25EF LARGE CIRCLE was added in Unicode 3.2 to allow U+25CB to retain it's role as standard-size circle.

f) The mapping for [jmath] is incorrect. A new character is being added to Unicode 4.1 at U+1D6A5 MATHEMATICAL ITALIC SMALL DOTLESS J (this character is in FDAM balloting; unless the FDAM fails unexpectedly, this location cannot change).

f) The mapping for [trpezium] should be updated, if possible. A new character is being added to Unicode 5.0 at tentative location U+23E2 TRAPEZIUM.

g) The mapping for [bsolhub] should be updated, if possible. A new character is being added to Unicode 5.0 at tentative location U+27C8 REVERSE SOLIDUS PRECEDING SUBSET

h) The mapping for [bsolhub] should be updated, if possible. A new character is being added to Unicode 5.0 at tentative location U+27C9 SUPERSET PRECEDING SOLIDUS

i) The mapping for [benzen] should be updated, if possible. A new character is being added to Unicode 5.0 at tentative location U+2B21 WHITE HEXAGON

j) The mapping for [benzenr] should be updated, if possible. A new character is being added to Unicode 5.0 at tentative location U+23E3 BENZENE RING WITH CIRCLE

k) The mapping for [hbenzen] should be further reviewed and if necessary updated at a later time. The mapping to U+2394 SOFTWARE FUNCTION SYMBOL correctly maps to a horizontal hexagon, but it isn't clear that U+2394 was intended to be unified with HORIZONTAL WHITE HEXAGON. The representative glyph in Unicode 4.0 shows a different proportion and positioning relative to a baseline. than that for the generic WHITE HEXAGON.

l) Use of spacing clones of combining characters. This is a generic issue. It seems advisable to include a generic note to the TR that explains the issue and gives a rationale for the choice made.

m) [phi] and [phiv] The mapping of the curly version should be to the character in them main sequence of the Greek alphabet in Unicode, U+03C6 GREEK SMALL LETTER PHI. A note explaining this should be added. In Unicode 3.0 the representative glyphs for SMALL PHI and PHI SYMBOL were swapped, to match the appearance of the typical glyphs used for SMALL PHI in most fonts. (The early SC2 standards used sans-serif fonts in the documentation, and unfortunately, for such fonts the opposite choice of a shape for SMALL PHI is common.) The Unicode standard is unambiguous that the straight version should be mapped to U+03D5.

The same applies to [b.phi] and [b.phiv] respectively.

n) The discrepancies listed for [circle] through [circleft] do not correspond to mappings in section 7. If mappings for these are required, the listed STIX mappings would be appropriate, except for mapping [circle] to RING OPERATOR. That seems inappropriate given the fact that in Unicode the correct member of the set is U+25CB WHITE CIRCLE.

o) The discrepancies listed for [diamond] and [diamondf] do not correspond to mappings in section 7, which only shows [loz] and [lozf] which are correctly mapped to U+25CA LOZENGE and U+29EB BLACK LOZENGE, respectively. If mappings for [diamond] and [diamondf] are required, the appropriate mappings would be to U+25C7 WHITE DIAMOND and U+25C7 BLACK DIAMOND, respectively.

p) The mapping for [diamonfb] should be updated, if possible. A new character is being added to Unicode 5.0 at tentative location U+2B19 DIAMOND WITH BOTTOM HALF BLACK

q) The mapping for [diamonfl] should be updated, if possible. A new character is being added to Unicode 5.0 at tentative location U+2B16 DIAMOND WITH LEFT HALF BLACK

r) The mapping for [diamonfr] should be updated, if possible. A new character is being added to Unicode 5.0 at tentative location U+2B17 DIAMOND WITH RIGHT HALF BLACK

s) The mapping for [diamonft] should be updated, if possible. A new character is being added to Unicode 5.0 at tentative location U+2B18 DIAMOND WITH TOP HALF BLACK

t) There is no character in Unicode corresponding to an [fjlig]. A mapping to f followed by j would preserve the content of the text and would work well wherever a font containing an fj ligature is not available. A mapping to f <zwj> j, is also possible, as <zwj> would be interpretable as a request to form the ligature by systems that do support that use of this character. However, systems not supporting <zwj> would have to correctly ignore it.

u) For the series of squares and partially filled squares, as shown by the entries for [squarf] through [squarftr] then care should be taken that they map to a consistent set of squares in Unicode. If that is not possible, a note indicating a rational for the choice should be provided.

U+25A1 WHITE SQUARE and U+25A0 BLACK SQUARE form part of a set containing the following partially filled squares:

[squarfb]       *U+2B13 SQUARE WITH BOTTOM HALF BLACK
[squarfbl]      **U+2B15 SQUARE WITH LOWER LEFT DIAGONAL HALF BLACK
[squarfbr]      U+25EA SQUARE WITH LOWER RIGHT DIAGONAL HALF BLACK
[squarfl]       U+25E7 SQUARE WITH LEFT HALF BLACK
[squarfr]       U+25E8 SQUARE WITH RIGHT HALF BLACK
[squarft]       *U+2B12 SQUARE WITH TOP HALF BLACK
[squarftl]      U+25E9 SQUARE WITH UPPER LEFT DIAGONAL HALF BLACK
[squarftr]      **U+2B14 SQUARE WITH UPPER RIGHT DIAGONAL HALF BLACK

* to be added to Unicode 4.1 with name and code location as shown
** to be added to Unicode 5.0 with tentative code location as shown
All others are part of Unicode 4.0 or earlier.

v) The mapping for [acd] should be updated, if possible. A new character is being added to Unicode 5.0 at tentative location U+23E6 AC CURRENT. A mapping to U+223F is not appropriate since [acd] contains a center line, not found in U+223F. If the mapping cannot be corrected, a note should be added, explaining the choice of mapping.

w) The mapping for [elinters] should be updated, if possible. A new character is being added to Unicode 5.0 at tentative location U+23E1 ELECTRICAL INTERSECTION

x) The mapping for [fltns] should be updated, if possible. A new character is being added to Unicode 5.0 at tentative location U+23E5 FLATNESS

y) The mapping of mathematical brackets to brackets in the CJK Punctuation section in Unicode as currently specified will lead to unexpected results. In Unicode, several sets of mathematical brackets have been explicitly disunified from East Asian mappings as many fonts will provide glyphs that are not usable in mathematical context and systems will apply special East Asian layout rules at least to the more common brackets used in East Asian text. In addition, Unicode considers the following character pairs canonical equivalents of each other, so that the first member of the pair would get substituted by the second member, anytime normalization was applied.

U+2329 LEFT-POINTING ANGLE BRACKET => U+3008 LEFT ANGLE BRACKET
U+232A RIGHT-POINTING ANGLE BRACKET => U+3009 RIGHT ANGLE BRACKET

The STIX mappings shown for [rang], [Rang], [lang], [Lang], [lobrk], and [robrk] are to the appropriate characters, which were  added in Unicode 3.2 explicitly to provide for bracket characters that are appropriate for mathematical usage. The current mappings should be reviewed for possible correction.

z) The mapping for [strns] should be updated, if possible. A new character is being added to Unicode 5.0 at tentative location U+23E4 STRAIGHTNESS

aa) The mapping of [barwed] to U+2305 PROJECTION appears questionable, in light of U+22BC NAND. Note that [Barwed] can be mapped to U+2A5E LOGICAL AND WITH DOUBLE OVERBAR, which seems more appropriate than mapping to U+2306 PERSPECTIVE. The current mapping should be reviewed and corrected.

Comment 6: Duplicate Entries

a) The mapping of [strns] to U+00AF should be removed, see comment 5z above.

b) The mapping of [imath] to U+0131 is incorrect. A new character is being added to Unicode 4.1 at U+1D6A4 MATHEMATICAL ITALIC SMALL DOTLESS I (this character is in FDAM balloting; unless the FDAM fails unexpectedly, this location cannot change).

c) The mapping of [b.Gammad] should be updated, if possible. A new character is being added to Unicode 5.0 at tentative location U+1D7CA MATHEMATICAL BOLD CAPITAL DIGAMMA

d) The mapping of [b.gammad] should be updated, if possible. A new character is being added to Unicode 5.0 at tentative location U+1D7CB MATHEMATICAL BOLD SMALL DIGAMMA

f) The mapping of [perp] to U+22A5 should be removed. A new character is being added to Unicode 4.1 at U+27C2 PERPENDICULAR (this character is in FDAM balloting; unless the FDAM fails unexpectedly, this location cannot change).

f) The mapping of [elinters] to U+FFFD should be removed, see comment 5w above.

g) The mapping of [trpezium] to U+FFFD should be removed, see comment 5f above.

h) The text in section 6.2 refers to the use of SPACE as a base character for combining character. Note that as of Unicode 4.1 the use of SPACE will be deprecated in favor of NBSP for this purpose. However, in the context of a markup language, where NBSP can be expected to be represented as an entity ( ) using NBSP would incur the same problems that the introduction of a stand-in base character was intended to resolve. If another entity, starting with a combining mark, were to be expanded, while NBSP was retained, the trailing ";" would be seen as a base character in any tool that is not actually parsing the markup language itself, but merely used to view or edit the source code.

i) Section 6.3 mentions characters in process of being added to ISO/IEC 10646 and Unicode. In our comments we have pointed out in detail where these might apply.

Comment 7: Character Listings

a) Perhaps this section should be titled "Entity Listings" instead.

b) There is a problem with the text of the fourth bullet (on the second line).

c) Some of the glyph images shown may be misleading. For example the glyph images for [loz] and [lozf] would be expected to match, except for one being filled. In the Unicode Standard, pairs of symbols that have the same name except for WHITE and BLACK tend to correspond in all aspects, except for the former being outlined and the latter being filled.

d) The word 'image' in the header for the third column should be capitalized.

e) The header for the last column should be "ISO/IEC 10646/Unicode Name"

All mapping issues in this section are already covered by other comments.

[End of Comments]

ISO/IEC JTC 1/SC 34N0601

ISO/IEC JTC 1/SC 34

Information Technology --Document Description and Processing Languages

Information Technology --
Document Description and Processing Languages