ISO/IEC JTC 1/SC34 N323

ISO/IEC JTC 1/SC34

Information Technology --

Document Description and Processing Languages

TITLE: Guide to the topic map standards
SOURCE: SC34
PROJECT: ISO 13250
PROJECT EDITOR: M. Biezunski, S. Newcomb, and M. Bryan
STATUS:
ACTION: For information
DATE: 2002-06-23
DISTRIBUTION: JTC1, SC34 and Liaisons
REFER TO:
SUPERSEDES: N278
REPLY TO: Dr. James David Mason
(ISO/IEC JTC1/SC34 Chairman)
Y-12 National Security Complex
Bldg. 9113, M.S. 8208
Oak Ridge, TN 37831-8208 U.S.A.
Telephone: +1 865 574-6973
Facsimile: +1 865 574-18964
Network: [email protected]
http://www.y12.doe.gov/sgml/sc34/
ftp://ftp.y12.doe.gov/pub/sgml/sc34/

Ms. Sara Hafele, ISO/IEC JTC 1/SC 34 Secretariat
American National Standards Institute
11 West 42nd Street
New York, NY 10036
Tel: +1 212 642 4976
Fax: +1 212 840 2298
Email: [email protected]

Guide to the topic map standardization

This document is a guide to the current topic maps standardization activities. It describes what is currently being done, the problems that need to be solved, and how those problems came to be. (In the opposite order, for ease of understanding.)

It is hoped that this guide will enable outsiders to the process to understand what is happening, and make it easier for them to contribute to the process.

The past

The topic maps work started out within the International Organization for Standardization (ISO), in a part of it today known as SC 34 (SC is short for subcommittee). This subcommittee works with SGML, DSSSL, HyTime, font standards, topic maps, the new XML schema language framework called DSDL, and other things. SC34 is divided into three working groups (WGs), and the topic maps work is done by WG3.

The first substantial result of the topic maps effort was ISO 13250:2000, an ISO standard that defined a syntax for topic maps. This syntax was an SGML DTD, which used the ISO 10744 HyTime standard for linking and addressing, and so the syntax is known as HyTM (short for HyTime Topic Maps). When HyTM was completed, there were three known issues with the syntax.

It is not an XML syntax
HyTM was specified using SGML, and as most people use XML, it was recognized that an XML version would be needed.
It is not a complete DTD
The HyTM syntax does not specify how references to external documents are to be represented, nor what syntax is to be used for internal references. The result was that each topic map software developer made its own HyTM version, derived from the standard HyTM DTD.
It does not use URIs for external references
A syntax which does not use URIs does not integrate well into a web context, and it was generally recognized that it was an important requirement that topic maps be able to do so.

In order to resolve these issues and adapt topic maps to the web the TopicMaps.Org organization was set up to create a new topic map syntax based on XML and URIs. The syntax TopicMaps.Org created is known as XTM (XML Topic Maps), and solves the problems described above. Today, the HyTM syntax is rarely used, as most people use XTM, precisely for these three reasons.

In October 2001 the XTM DTD was accepted into ISO 13250, and so the second edition of ISO 13250 now contains two syntaxes: HyTM and XTM.

The present

Some problems remain, however. The current ISO 13250 defines two interchange syntaxes (XTM and HyTM), but does not explain how they relate to one another. There are a number of non-trivial differences between the syntaxes, which is what makes this a problem. For example, the structure of topic names is different in the two syntaxes. In HyTM the structure of names is as shown below.

  <topname scope="...">
    <basename scope="...">...</basename>
    <dispname scope="...">...</dispname>
    <sortname scope="...">...</sortname>
  </topname>

In XTM, however, the structure is as shown below.

  <baseName>
    <scope>...</scope>
    <baseNameString>...</baseNameString>
    <variant>
      <parameters>...</parameters>
      <variantName>...</variantName>
    </variant>
  </baseName>

The problem is how to relate display and sort names to variant names, and also how the different ways to specify scope match up. This is just one example of the differences between the two syntaxes, and given these differences, it is not obvious how to map between them. This is a problem, since implementors are likely to choose different approaches, and this is likely to cause interoperability problems.

Another problem is that both syntax specifications in the current ISO 13250 fail to specify what implementations are to do in a number of situations. The basics of what implementations are supposed to do are clear, but there are a number of places where the specifications are not clear on what is supposed to happen. In some of these cases developers have interpreted the specification text differently, and this causes interoperability problems. If different implementations interpret the same topic map differently topic map applications may only work with a single implementation, which defeats the purpose of having a standard in the first place.

ISO SC34 has also resolved to create two new topic map standards:

ISO 18048: Topic Maps Query Language (TMQL)
A query language for topic maps. This language is intended to be a kind of SQL (or XML Query) for topic maps, and will greatly simplify topic map application development by making it much easier to extract information from topic maps. A requirements specification has been created.
ISO 19756: Topic Maps Constraint Language (TMCL)
A schema or constraint language for topic maps. Using TMCL one can write schemas for topic maps that constrain what is allowed to say in the topic map, such as "a person must be born in a place," "a person must have at least one name," and so on. A requirements draft has been created.

Both of these standards need to explain how the constructs in them are evaluated, but the existing ISO 13250 does not provide a suitable basis for such definitions. For example, when TMQL defines the "find all base names of topic X in scope Y"-operator it needs to explain carefully and formally what that operator does. This could be done in terms of the XTM syntax, but it would then be difficult to see how to apply it to the HyTM syntax. The explanation would also become very involved, as XTM provides many different ways to express the same thing, and merging of topics within the topic map must be performed before queries can be done.

So while the community is generally satisfied with the two syntaxes, their specifications are in need of improvement on three counts:

  • Not all developers interpret them the same way.
  • They need to clearly relate the two syntaxes to one another.
  • They do not provide suitable foundations for the TMQL and TMCL standards.

ISO SC34's solution to this is the topic map data model work that was started in May 2001, and is now beginning to produce tangible results, in the form of N0298R1 and N0299. (See also the SAM home page.) The TMQL and TMCL work is currently waiting for the data model work before continuing, as both depend on the outcome of that work.

The future

ISO SC34's current plan is to revise ISO 13250 into a multi-part standard that resolves the problems described in the previous section. A key part of this new edition of the standard will be what is known as the Standard Application Model (SAM), a formal data model for topic maps. This model will be based on the same formalism as the XML Information Set. It will define the allowed structure of topic maps, as well as how to perform key operations such as merging and duplicate removal. The SAM is what will allow SC34 to solve the problems with the interpretations of the specifications, relate HyTM and XTM to one another, and create a foundation for TMQL and TMCL.

The problem with the interpretation of the ISO 13250:2000 and XTM 1.0 specifications will be solved by writing new specifications for the HyTM and XTM syntaxes based on the SAM. The new versions of the syntax specifications will describe how to build an instance of the SAM model from a document in a given syntax, but will not change the syntaxes themselves. That is, they will say such things as "for each <topic> element in the document, create a topic item", "for each <baseName> child of the <topic> element, create a base name item and add it to the [base names] property of the corresponding topic item," and so on.

This will be done more formally than in the examples above, and in a way that leaves much less room for interpretation. Rewriting the syntax specifications in this way will also solve the problem of how to relate the XTM syntax to HyTM, and vice versa. The SAM will now serve as a common point of reference for the two syntaxes, and comparison of parts of the syntaxes can be done by comparing the SAM models they create.

This solution will work even for new topic map syntaxes, should any new syntaxes be created in the future, and it provides a way to relate non-standard topic map syntaxes (such as LTM and AsTMa) to the standard ones. It also provides a way to make mappings from syntaxes that do not directly represent topic maps, but closely related information, such as NewsML and XFML.

The SAM provides a much more suitable basis for TMQL and TMCL, since it unites the different syntaxes and provides a much more convenient basis for operator definitions. Defined using the SAM the "find all base names of topic X in scope Y"-operator would become something like "traverse the [base names] property of topic item X and return all base name items whose [scope] property contains topic item Y". (In practice the definition is likely to be somewhat different, but this is the basic idea.) TMQL and TMCL will then also be applicable to any topic map syntax that has a mapping to the SAM model.

Canonicalization

Although the new specifications will be clearer than the previous versions it will still be necessary to verify that implementations actually do conform to the specifications. This is best done by creating a conformance test suite, much like those already created for XML and XSLT. It is easy to create a set of topic map documents in the XTM and HyTM syntaxes, but harder to define what their correct interpretation is.

One way to do it is to create a so-called canonical syntax. In this syntax, every logically equivalent topic map would be represented as exactly the same sequence of bytes. This means that in order to see how a topic map engine interprets an XTM file, one could import that file into the engine, and then export it using the canonical syntax. The test suite could then consist of a set of XTM and HyTM documents with their corresponding canonical representations, and conformance testing could be automated.

The new ISO 13250 standard is going to contain just such a Canonical Topic Map syntax. It is expected that a conformance test suite will be developed, either within OASIS or within ISO, once the necessary infrastructure is in place. There also exists an early proposal for such a canonical syntax.

The Reference Model

The new ISO 13250 will also include a model known as the Reference Model, which is a more abstract graph model of topic maps. In this model, names and occurrence resources turn into nodes on the same level as topics, and they are related to their topics using an association-like structure of nodes and arcs. The result is a model that uses fewer constructs than the SAM, and which can be extended without changing the metamodel.

The Reference Model provides a mechanism for explaining the relationships between different knowledge representations, such as topic maps, RDF, and KIF. This will make it easier for topic maps to interoperate with these other knowledge representations.

It is planned that the SAM part of the standard will include a normative mapping of the SAM to the Reference Model. The TMQL and TMCL standards will thus relate to the Reference Model through the SAM. Obviously, it is very important that the SAM and the RM are consistent, and much work will go into ensuring that this is the case.

Overview

Below is shown a conceptual diagram of the relationships between the different parts of the new ISO 13250, as well as TMQL and TMCL:

[Diagram of new TM standards]

The parts of the new ISO 13250 standard will be:

  • Part 0: A guide to the structure of the standard (Currently unknown)
  • Part X: The Standard Application Model (Lars Marius Garshol and Graham Moore)
  • Part X: The Reference Model (Steven R. Newcomb and Michel Biezunski)
  • Part X: The XML Topic Maps syntax (XTM) (Lars Marius Garshol and Graham Moore)
  • Part X: The HyTime Topic Maps syntax (HyTM) (Lars Marius Garshol and Graham Moore)
  • Part X: Canonicalization of topic maps (Currently unknown)

There is currently no clear timeframe for the finalization of these specifications.

Meanwhile, at OASIS...

In order for topic maps created by different parties to merge correctly it is crucial that these parties use the same identifiers for their topics. This is unlikely to happen by itself, however, and therefore three Technical Committees (TCs) have been formed within OASIS, in order to work on something called published subjects. These are URIs and descriptions for concepts considered important by some publisher.

The three OASIS TCs are:

Published subjects TC
Creates guidelines and recommendations for how to create, publish, and maintain published subject sets.
XML Vocabulary TC
Creates a vocabulary (or ontology) consisting of published subjects for the domain of core XML standards and technologies.
Geography and languages TC
Creates published subject sets for geographical and linguistic concepts. These published subject sets will be based on existing code sets such as ISO 639 and ISO 3166.

The published subjects activity within OASIS will layer on top of specifications produced by ISO SC34, and will not in any way interfere with what SC34 is doing.