ISO/IEC JTC 1/SC 34N0392

ISO/IEC logo

ISO/IEC JTC 1/SC 34

Information Technology --
Document Description and Processing Languages

TITLE: Comments on XML Schema Datatype made by ISO/IEC JTC 1/SC 34/WG1
SOURCE: Mr. Martin Bryan
PROJECT: WD 19757-5: Document Schema Definition Language (DSDL) Part 5 - Datatypes
PROJECT EDITOR: Mr. Martin Bryan
STATUS: Comments submitted to W3C Schema Working Group 27/5/2003
ACTION: For information
DATE: 2003-05-08
DISTRIBUTION: SC34 and Liaisons
REPLY TO:

Dr. James David Mason
(ISO/IEC JTC 1/SC 34 Chairman)
Y-12 National Security Complex
Bldg. 9113, M.S. 8208
Oak Ridge, TN 37831-8208 U.S.A.
Telephone: +1 865 574-6973
Facsimile: +1 865 574-1896
Network: [email protected]
http://www.y12.doe.gov/sgml/sc34/
ftp://ftp.y12.doe.gov/pub/sgml/sc34/

Mr. G. Ken Holman
(ISO/IEC JTC 1/SC 34 Secretariat - Standards Council of Canada)
Crane Softwrights Ltd.
Box 266,
Kars, ON K0A-2E0 CANADA
Telephone: +1 613 489-0999
Facsimile: +1 613 489-0995
Network: [email protected]
http://www.jtc1sc34.org



Comments on XML Schema Datatype made by ISO/IEC JTC 1/SC 34/WG 1

As part of its review of the requirements for Part 5 of the ISO 19575 Document Schema Definition Language (DSDL), which will introduce datatype validation into document validation pipelines, Working Group 1 of ISO/IEC Subcommittee SC34 have identified the problems listed below with the application of the World Wide Web Consortium s XML Schema, Part 2: Datatypes recommendation. The order in which the comments are made is not in any way significant: where possible related subjects have been grouped together, irrespective of the strength of feeling on specific issues. Items have been numbered simply for ease of reference in subsequent discussions: the numbers in no way indicate the relevance of the comment.

  • The W3C XML Schema Datatypes recommendation (XSD) defines too many primitives, many of which could be derived from a smaller set of basic primitives .
  • Some XSD primitives essentially duplicate each other (e.g. xs:float and xs:double).
  • XSD does not cover a wide enough range of derived primitives for mathematical purposes, such as complex numbers, rational numbers and imaginary numbers, or provide adequate facets for defining precision (including ones that allow you to identify values as being exact or approximate).
  • Users are not allowed to extend the set of primitive types from which complex types can be derived.
  • Primitive types should be very few, and geared around the operations to be performed on them for comparison: e.g. number, Boolean, string and binary.
  • Notations and other forms of externally defined processes cannot be used to validate the contents of an element or attribute. In particular there needs to be support for those notations that support XML, such as CSS, JavaScript, and Xquery.
  • XSD does not support the XPath conformant strings as a datatype.
  • The xs:base64binary format does not require the sequence to be split by linebreaks within limits that ensure safe transmission of data. The privilege assigned to this binary format over others, based on its use within WWW applications, should require that it conforms to WWW practice (e.g. linebreaks every 76 characters).
  • There is no octet-based binary notation.
  • XSD is English biased. It does not support internationalized Boolean values. Some commentators suggest that binary encoding should be considered a facet rather than a datatype, which would assist its internationalisation.
  • The mechanisms provided for referencing between data objects are inadequate.
  • Most of the subtypes of String should be redefined as applications of the pattern facet.
  • Different types of string pre-processing are used in XSD to distinguish names, tokenised strings and normalized strings, complicating processes such as pattern matching within subsequent datatype validation processes, particularly those related to localization efforts. It should be possible to recognize validation patterns within strings in their raw state so that, for example, the use of line feeds within an element s contents can be flagged as an error.
  • Users should be able to define a transformation between the parsed and the
  • lexical space. For example, a date expressed as "24 mai 2002" could be transformed into the corresponding ISO format ("2002-05-24") or into "<year>2002</year><month>05</month><day>24</day>", while a number expressed as 1.200,50 could be transformed into "1200.5".
  • XSD dates force you to use hyphens and colons as separators, rather than slashes, etc. This mitigates against natural entry of dates by users as part of the contents of elements whose dates need to be validated.
  • XSD does not support localized datatypes, particularly for dates. In publishing data is often a given fact that cannot be tampered with. Using XML Schemas, you cannot, for instance, say <USdate>12/31/02</USdate> but have to use the form <USdate value="20021231">12/31/02</USdate>. It should be possible to validate dates and other values in their localized form without having to create a normalized form.
  • XSD does not support all date formats allowed by ISO 8601. In particular it does not provide proper support for the definition of recurring periods, which are required for many business processes.
  • Dates should not be restricted to the Gregorian calendar. It should be possible to define dates using major international calendar formats, including those based on lunar cycles, such as the Islamic calendar. XSD has too many primitives that only relate to the Gregorian form of dates.
  • XSD timezone support does not recognize commonly used timezone identification strings (GMT, EST, etc)
  • There are no facilities for the validation of structured datatypes other than URIs and dates. Of importance to the publishing community are the validation of measurements in multiple dimensions, prices, fractions, identifiers with check digits (e.g. ISBN, EAN and credit card numbers), and arrays e.g of X/Y points as used in SVG.
  • All datatypes should have fallback symbols which are tokens (e.g. #undefined ) that can be used to override datatype validation within specific instances of an element or attribute.
  • It should be possible to multiply a number by another rational number to change its units of measurements before checking it falls within a range defined using that unit of measurement. (You may also need to allow addition and subtraction, e.g. to convert from Centigrade to Fahrenheit. One possibility is to define a subset of MathML functionality that could be used to define mathematical processes that need to be performed within validation rules.)
  • XSD does not allow an externally defined list be used in place of a locally defined enumeration. For data validation purposes the list of valid values needs to be dynamically updatable without having to redefine the underlying schema each time a change is require in one table.
  • XSD does not allow positional validation of tokens against lists. For example, checking that the first token is a number and the second is a units values stored in a particular list.
  • XSD does not allow you to state that part of a name or string must be the name of a currently valid namespace followed by a colon or that local names are valid in the specified context (e.g. to validate the a in an attribute value of he form a:b is valid in the current context, and that b is valid within a). As metadata schemes are using namespaces to identify provenance of values a mechanism to formally validate such values is now required.
  • XSD does not allow you to make the set of valid values for one element or attribute dependent on the content of another element or attribute. For example, if the content of one element is Male then the list of valid diseases accepted in another element should be one that does not include those related to gynaecology.
  • Regular expression definition is too complicated, and hides internal structure rather than exposing it. It needs to support alternative ways of tokenising strings into primitives, e.g. pictures, freeform quantities.
  • The set of codes used to identify valid values of the XSD language datatype (the set defined in RFC 1766, which is derived from ISO 639-2) is inadequate for full identification of the language coverage of texts. An extensible mechanism is used for the benefit of publishers working with minority languages, or targeting texts for specific sub-communities within a linguistic area (e.g. children or those with limited skills in a language).
  • There needs to be a way to distinguish between a deliberately nulled value (e.g. nill) and the absence of a specified value.
  • Type derivation by union across datatypes where string is one of the datatypes is potentially dangerous.
  • xs:enumeration and xs:pattern are not coherent: xs:enumeration allows extension of the value space while xs:pattern restricts the lexical space. These are, moreover, the only facets which may be repeated in a restriction
  • step.
  • Many facets (for instance length, minLength and maxLength) are defined
  • independently while they, in fact, are dependent. Defining them as intervals would be better.
  • The obfuscatory descriptions used to formally define XML Schema datatypes makes it difficult for users to ensure the rules are applied correctly. Definitions should be kept as simple as possible without prejudicing validation.