Document Description and Processing Languages


Proposal for a Restatement of ISO 8879 with Web SGML Adaptations and Extended Naming Rules


Dr. Charles F. Goldfarb, Convenor of SC34/WG1


All SC34 projects

Status of Document:

Personal contribution

Requested action:

For information

Summary of major points:

A "restatement" of portions of ISO 8879 is proposed as a means of developing and testing presentation techniques for the ISO 8879 revision.


23 April 1999


SC34 and liaisons

Proposal for a Restatement of ISO 8879 with Web SGML Adaptations and Extended Naming Rules

Copyright (C)1999 Charles F. Goldfarb. I hereby grant the ISO, its member bodies, liaisons, affiliates, committees, their subdivisions, individual members and employees of all of the above, and all other participants in International Standards development, the nonexclusive perpetual right to use this contribution for the purpose of International Standards development, including incorporation into International Standards that may be copyrighted. I retain my copyright and all other rights in the content of this document as contributed, including publication rights, and I expressly acknowledge that making this contribution does not entitle me to any rights in International Standards that may be derived from it.

The author proposes that a document be developed informally to test what will hopefully be an improved and simplified style of expressing the revised ISO 8879. The subject of the test would be the body and normative annexes of ISO 8879 as modified when Annexes J and K are supported, and portions of the SGML Extended Facilities. Certain optional features (e.g., CONCUR, LINK, DATATAG, RANK) could be ignored.

The document, when completed, would be a restatement of the above-described portion of ISO 8879. Errors in the standard will have been corrected in it and ambiguities will have been clarified, which means (by definition) that some systems that claim conformance to the International Standard would not conform to the restatement while others (hopefully the majority) would. These changes should be documented in the restatement.

The author offers the following "strawman" as a starting point for this work. He extends his thanks to Lynne Price, Yasuhiro Okui, and Martin Bryan for their many helpful suggestions.

1 Introduction

This Restatement is the specification for the Standard Generalized Markup Language (SGML), a standardized means of representing information in digital form.

SGML deals with the representation of documents: collections of information that are intended for human perception. Documents are potentially compound datatypes, also called data structures. They can contain character text, static graphics, and/or multimedia data objects.

Document representation involves three domains: notational, conceptual, and physical:

The notational domain is where the syntax of a representation resides. In the case of SGML, that means text strings containing markup (tags, markup declarations, delimiters, etc.) and data characters.

The conceptual domain is that of the objects and properties that the notation describes. For SGML, that means documents, elements, element types, attributes, document types, etc., as well as references to objects (such as graphics and multimedia) that may be represented in notations other than SGML.

The physical domain relates to storage of the objects containing the notation strings. For SGML documents, that means SGML text strings and non-SGML text and bit strings. Physical domain constructs include characters, entities, storage objects, etc.

This Restatement attempts to distinguish the three domains clearly at all times. (This was not an objective for ISO 8879.)

1.1 Conceptual Domain

Conceptual objects exist in people's minds; we don't know how they are represented there. Computers can only deal with representations of conceptual objects.

Two forms of computer representation have proven to be useful: a "native" or directly processable form, which is described in this section (1.1), and a portable or "string" form, which is described in the next section (1.2).

When computers purport to process conceptual objects, the programs actually construct internal representations of them. Such representations allow the programs to address the conceptual objects and their properties natively, and to navigate among them. A native representation may be temporary or persistent, and either proprietary to the program that creates it or common to a set of programs.

This Restatement provides a standardized common model for native representation of conceptual objects, known as "grove". A grove is a graph structure with multiple classes of arcs, where only one class indicates the parent-child relationship and the other classes of arcs indicate other relationships.

As a result, a grove can be viewed as a set of disjoint trees, the nodes of which can be connected to nodes in other trees by arcs indicating relationships other than parent-child. For example, a grove representing an SGML document typically has one tree (the "content tree") representing the parent-child relationships of the element hierarchy and additional trees for the attributes of each element.

The arc connecting an element node to its attributes is not a parent-child arc, and therefore the attribute nodes are not considered siblings of the subelement nodes (as they would be in the traditional "parse tree" view of documents). A content tree in a grove can therefore be addressed and traversed without fear of stumbling into the attribute trees, and vice versa.

1.2 Notational Domain

Representations of conceptual objects that permit native addressing are typically too complex for persistent storage or interchange (particularly if humans are to interact directly with the representation). For those purposes, a notation that can be stored, transmitted, and processed as a simple bit string or character string is more appropriate.

Bit string ("binary") notations are usually used for graphics and multimedia data objects. Character-based notations are usually used for documents as a whole and, of course, for character data.

The raw character string that conforms to a notation is called the "text". It is parsed in order to distinguish markup from data and to interpret the markup to glean information about the conceptual objects.

One parser's data can be another parser's text. For example, the data content of an element in SGML can be in a data notation. This may be indicated by use of a "notation" attribute and the notation declaration associated with its value. Where a data notation is not indicated, the characters are considered to have the meaning assigned by the applicable character set.

Moreover, the notation that is the SGML language is actually made up of several sub-notations: markup declarations that represent DTDs, formal public identifier and URN notations that represent public identifiers, etc.

A character string that describes a complete conceptual document in the manner specified in this Restatement is known as an "SGML document string". The conceptual object that it describes is known as an "SGML document".

1.2.1 Characters

A character is an idea that is represented in a computer by first associating it with a number ("character number"). The number is unique within a set of such numbers, each of which is mapped to a single character, the whole set of such mappings forming a "character set".

A character number is represented in a computer by a bit string ("bit combination"). The way in which the character numbers of a character set are represented is known as the "encoding" of the character set. A large character set may allow several possible encodings, in order to optimize storage and transmission when only a subset of the character set is in use.

Characters are in the conceptual domain, character numbers are part of the notational domain, and encodings are in the physical domain. However, it is customary to speak of "characters" in the notational domain even though character numbers are normally what is meant.

1.3 Physical Domain

Notation strings are ultimately stored at real storage locations, either persistent (e.g., a file in an OS file system or object in a database) or transitory (e.g., the feed from a network source or the generated output of a program).

As real locations are subject to change, and it is desirable for persistent data to be free of system dependencies, SGML interposes a virtual storage object -- the entity -- between the notation strings and real storage. The entity exists in the conceptual domain as well as the physical, which allows properties to be associated with storage and accessed from the grove.

Entities are named and mapped onto real storage by a form of markup declaration known as an "entity declaration". The entity names are used within the SGML document string to refer to the entities.

1.3.1 SGML entities

An SGML entity is a sequence of characters (character string).

The entity in which an SGML document string begins and ends is known as an "SGML document entity". Other entities that contain portions of that SGML document string are known as "SGML text entities". The point at which the parser should access an SGML text entity is indicated in the SGML document string by an "entity reference" containing the entity's name.

The SGML document string is the sequence of characters in the order presented to the parser, beginning with the first character of the SGML document entity. It includes entity references to SGML text entities, followed by their replacement text, and an indicator of the last character of the replacement text: the "entity-end" (Ee) signal.

1.3.2 Data entities

Entities that contain representations of other kinds of conceptual objects that are part of an SGML document are "data entities". The representations are either in character-based notations other than SGML, or in binary notations. Data entities are not parsed as part of an SGML document string; however, their names can be used in the values of attributes so that the objects they represent can be processed as part of the conceptual document.

1.3.3 Subdocuments

A conceptual SGML document can also contain other SGML documents ("subdocuments"). Like all SGML documents, subdocuments are represented as SGML document strings and stored in SGML document (and possibly in SGML text) entities. In other respects they are treated like data entities. That is, they are not parsed as part of the SGML document string in which they are declared; however, their names can be used in the values of attributes so that the documents that they represent can be processed as part of the conceptual document.