Generated text, white space handling and the <x> element

The x element was originally introduced in the Blackwell Publishing Group DTD 3. The concept of retaining inter-element punctuation in an SGML/XML document is relatively unusual and we are often asked why we do it. Here's why...

Generated text

Many SGML and XML DTDs consider punctuation to be generated text i.e. the punctuation required is generated by stylesheet rules and is not stored in the document. The disadvantages of relying on stylesheet rules to create generated text are:

  1. The XML document is no longer a 'standalone' document
  2. The generation rules need to be stored along with the document throughout its life
  3. The document cannot be read without applying a process which applies the punctuation rules to the XML document
  4. It can be quite inefficient if different rules/templates have to be created to reflect differing punctuation styles across a store of documents

Storing the generated text in an x element in the XML document means that the XML fragment can easily be converted to, for example, simple text, a typesetting format or HTML without the need for complicated templates or rules. If existing rules for generating punctuation are already in place, or if more 'abstract' XML is required, then the contents of the <x> element can be ignored and the rules applied.

Certain elements e.g. keyword, abstract, bibliography have headings ("Keywords","Abstract", "References" or "Bibliography") which are implied by their parent element and are usually generated text. If a heading is implied like this, its implicit attribute value should be switched to yes so that the heading can be displayed or suppressed as necessary.

White space handling

XML processors (e.g. XSL processors like MSXML) typically do not preserve white space which can cause spacing problems in running text. For example, an XSL processor would not preserve the white space between the following elements:

<name type="author"><forenames>Norman</forenames> <surname>Jones</surname></name>

An XSL processor, would consider the parsed contents of name to be:

NormanJones

To ensure that this significant white space is preserved, in DTD 4, the space can be tagged as <x> </x> as follows:

<name type="author"><forenames>Norman</forenames><x> </x><surname>Jones</surname></name>

Because the x element in DTD 4 has been given an xml:space attribute whose default value is 'preserve', XML processors should retain this space in accordance with the definition of xml:space given at http://www.w3.org/TR/2000/REC-xml-20001006#sec-white-space.

White space around entites

Character entities such as &alpha; in DTD 4 are resolved to symbol elements by bpg4-0entities.mod:

<!ENTITY alpha      "<symbol name='alpha'>&#945;</symbol>"   >

In practice, this means that the following text

... the conculsions were</b> &alpha; genes were ... 

would resolve to

... the conculsions were</b> <symbol name='alpha'>&#945;</symbol> genes were ... 

The highlighted space would not be preserved by an XML processor. To prevent this, significant spaces around entities should be tagged as <x>...</x> to prevent this happening. i.e.

... the conculsions were</b><x> </x>&alpha; genes were ... 

Back