SAX parsing and serialization

This chapter describes CXML's SAX-like parser interface.

The SAX layer is an important concept in CXML that users will encounter in various situations:

However, SAX events are easier to generate than to process. That is why CXML offers Klacks, a "pull-based" API in addition to SAX. Klacks events are generally easier to process than to generate. Please refer to the Klacks documentation for details.

Parsing and Validating

Old-style convenience functions:

Function CXML:PARSE-FILE (pathname handler &key ...)

Same as cxml:parse with a pathname argument. (But note that cxml:parse-file interprets string arguments as namestrings, while cxml:parse expects literal XML documents.)

Function CXML:PARSE-STREAM (stream handler &key ...)

Same as cxml:parse with a stream argument.

Function CXML:PARSE-OCTETS (octets handler &key ...)

Same as cxml:parse with an octet vector argument.

Function CXML:PARSE-ROD (rod handler &key ...)

Same as cxml:parse with a string argument.

New all-in-one parser interface:

Function CXML:PARSE (input handler &key ...)

Parse an XML document, where input is a string, pathname, octet vector, or stream. Return values from this function depend on the SAX handler used.
Arguments:

Common keyword arguments:

Note: parse-rod assumes that the input has already been decoded into Unicode runes and ignores the encoding specified in the XML declaration, if any.

Function CXML:PARSE-EMPTY-DOCUMENT (uri qname handler &key public-id system-id entity-resolver recode)

Simulate parsing a document with a document element qname having no attributes except for an optional namespace declaration to uri. If an external ID is specified (system-id, public-id), find, parse, and report this DTD as if with parse-file, using the specified entity resolver.

Function CXML:PARSE-DTD-FILE (pathname)
Function CXML:PARSE-DTD-STREAM (stream)
Parse declarations from a stand-alone file and return an object representing the DTD, suitable as an argument to validate.

Function CXML:MAKE-EXTID (publicid systemid)
Create an object representing the External ID composed of the specified Public ID, a rod or nil, and System ID (an URI object).

Condition class CXML:XML-PARSE-ERROR ()
Superclass of all conditions signalled by the CXML parser.

Condition class CXML:WELL-FORMEDNESS-VIOLATION (cxml:xml-parse-error)
This condition is signalled for all well-formedness violations. (Note that, when parsing document that is not well-formed in validating mode, the parser might encounter validity errors before detecting well-formedness problems, so also be prepared for validity-error in that situation.)

Condition class CXML:VALIDITY-ERROR (cxml:xml-parse-error)
Reports the violation of a validity constraint.

Serialization

Serialization is performed using sink objects. There are different kinds of sinks for output to lisp streams and vectors in various flavours.

Technically, sinks are SAX handlers that write XML output for SAX events sent to them. In practise, user code would normally not generate those SAX events manually, and instead use a function like dom:map-document or xmls-compat:map-node to serialize an in-memory document.

In addition to map-document, cxml has a set of convenience macros for serialization (see below for with-xml-output, with-element, etc).

Portable sinks:
Function CXML:MAKE-OCTET-VECTOR-SINK (&rest keys) => sink
Function CXML:MAKE-OCTET-STREAM-SINK (stream &rest keys) => sink
Function CXML:MAKE-ROD-SINK (&rest keys) => sink

Only on Lisps with Unicode support:
Function CXML:MAKE-STRING-SINK -- alias for cxml:make-rod-sink
Function CXML:MAKE-CHARACTER-STREAM-SINK (stream &rest keys) => sink

Only on Lisps without Unicode support:
Function CXML:MAKE-STRING-SINK/UTF8 (&rest keys) => sink
Function CXML:MAKE-CHARACTER-STREAM-SINK/UTF8 (stream &rest keys) => sink

Return a SAX serialization handle.

Keyword arguments:

The following canonical values are allowed:

An internal subset will be included in the result regardless of the canonical setting. It is the responsibility of the caller to not report an internal subset for canonical <= 1, or only notations as required for canonical = 2. For example, the include-doctype argument to dom:map-document should be set to nil for the former behaviour and :canonical-notations for the latter.

With an indentation level, pretty-print the XML by inserting additional whitespace.  Note that indentation changes the document model and should only be used if whitespace does not matter to the application.

Macro CXML:WITH-XML-OUTPUT (sink &body body) => sink-specific result
Macro CXML:WITH-NAMESPACE ((prefix uri) &body body) => result
Macro CXML:WITH-ELEMENT (qname &body body) => result
Macro CXML:WITH-ELEMENT* ((prefix lname) &body body) => result
Function CXML:ATTRIBUTE (qname value) => value
Generic Function CXML:UNPARSE-ATTRIBUTE (value) => string
Function CXML:ATTRIBUTE* (prefix lname value) => value
Function CXML:TEXT (data) => data
Function CXML:CDATA (data) => data
Function CXML:doctype (name public-id system-id &optional internal-subset)
Convenience syntax for event-based serialization.

Example:

(with-xml-output (make-octet-stream-sink stream :indentation 2 :canonical nil)
  (with-element "foo"
    (attribute "xyz" "abc")
    (with-element "bar"
      (attribute "blub" "bla"))
    (text "Hi there.")))

Prints this to stream:

<foo xyz="abc">
  <bar blub="bla"></bar>
  Hi there.
</foo>

Macro XHTML-GENERATOR:WITH-XHTML (sink &rest forms)
Macro XHTML-GENERATOR:WRITE-DOCTYPE (sink)
Macro with-xhtml is a modified version of Franz' htmlgen works as a SAX driver for XHTML. It aims to be a plug-in replacement for the html macro.

xhtmlgen is included as contrib/xhtmlgen.lisp in the cxml distribution. Example:

(let ((sink (cxml:make-character-stream-sink *standard-output*)))
  (sax:start-document sink)
  (xhtml-generator:write-doctype sink)
  (xhtml-generator:with-html sink
    (:html
     (:head
      (:title "Titel"))
     (:body
      ((:p "style" "font-weight: bold")
       "Inhalt")
      (:ul
       (:li "Eins")
       (:li "Zwei")
       (:li "Drei")))))
  (sax:end-document sink))

Miscellaneous SAX handlers

Function CXML:MAKE-VALIDATOR (dtd root)
Create a SAX handler which validates against a DTD instance.  The document's root element must be named root.  Used with dom:map-document, this validates a document object as if by re-reading it with a validating parser, except that declarations recorded in the document instance are completely ignored.
Example:

(let ((d (parse-file "~/test.xml" (cxml-dom:make-dom-builder)))
      (x (parse-dtd-file "~/test.dtd")))
  (dom:map-document (cxml:make-validator x #"foo") d))

Class CXML:BROADCAST-HANDLER ()
Accessor CXML:BROADCAST-HANDLER-HANDLERS
Function CXML:MAKE-BROADCAST-HANDLER (&rest handlers)
broadcast-handler is a SAX handler which passes every event it receives on to each of several chained handlers, somewhat similar to the way a broadcast-stream works.

You can subclass broadcast-stream to modify the events before they are being passed on. Define methods on your handler class for the events to be modified. All other events will pass through to the chained handlers unmodified.

Broadcast handler functions return the result of calling the event function on the last handler in the list. In particular, the overall result from sax:end-document will be ignored for all other handlers.

Class CXML:SAX-PROXY (broadcast-handler)
Accessor CXML:PROXY-CHAINED-HANDLER
sax-proxy is a subclass of broadcast-handler which sends events to exactly one chained handler. This class is still included for compatibility with older versions of CXML which did not include the more general broadcast-handler yet, but has been retrofitted as a subclass of the latter.

Accessor CXML:MAKE-NAMESPACE-NORMALIZER (next-handler)

Return a SAX handler that performs DOM 3-style namespace normalization on attribute lists in start-element events before passing them on the next handler.

Function CXML:MAKE-WHITESPACE-NORMALIZER (chained-handler &optional dtd)
Return a SAX handler which removes whitespace from elements that have element content and have not been declared to preserve space using an xml:space attribute.

Example:

(cxml:parse-file "example.xml"
                 (cxml:make-whitespace-normalizer (cxml-dom:make-dom-builder))
                 :validate t)

Example input:

<!DOCTYPE test [
<!ELEMENT test (foo,bar*)>
<!ATTLIST test a CDATA #IMPLIED>
<!ELEMENT foo #PCDATA>
<!ELEMENT bar (foo?)>
<!ATTLIST bar xml:space (default|preserve) "default">
]>
<test a='b'>
  <foo>   </foo>
  <bar>   </bar>
  <bar xml:space="preserve">   </bar>
</test>

Example result:

<test a="b"><foo>   </foo><bar></bar><bar xml:space="preserve">   </bar></test>

Recoders

Recoders are a mechanism used by CXML internally on Lisp implementations without Unicode support to recode UTF-16 vectors (rods) of integers (runes) into UTF-8 strings.

User code does not usually need to deal with recoders in current versions of CXML.

Function CXML:MAKE-RECODER (chained-handler recoder-fn)
Return a SAX handler which passes all events on to chained-handler after converting all strings and rods using recoder-fn, a function of one argument.

Caching of DTD Objects

To avoid spending time parsing the same DTD over and over again, CXML can cache DTD objects. The parser consults cxml:*dtd-cache* whenever it is looking for an external subset in a document which does not have an internal subset and uses the cached DTD instance if one is present in the cache for the System ID in question.

Note that DTDs do not expire from the cache automatically. (Future versions of CXML might introduce automatic checks for outdated DTDs.)

Variable CXML:*DTD-CACHE*
The DTD cache object consulted by the parser when it needs a DTD.

Function CXML:MAKE-DTD-CACHE ()
Return a new, empty DTD cache object.

Variable CXML:*CACHE-ALL-DTDS*
If true, instructs the parser to enter all DTDs that could have been cached into *dtd-cache* if they were not cached already. Defaults to nil.

Reader CXML:GETDTD (uri dtd-cache)
Return a cached instance of the DTD at uri, if present in the cache, or nil.

Writer CXML:GETDTD (uri dtd-cache)
Enter a new value for uri into dtd-cache.

Function CXML:REMDTD (uri dtd-cache)
Ensure that no DTD is recorded for uri in the cache and return true if such a DTD was present.

Function CXML:CLEAR-DTD-CACHE (dtd-cache)
Remove all entries from dtd-cache.

fixme: thread-safety

Location information

Class SAX:SAX-PARSER ()
A class providing location information through an implementation-specific subclass. Parsers will use sax:register-sax-parser to pass their parser instance to the handler. The easiest way to receive sax parsers instances is to inherit from sax-parser-mixin when defining a sax handler.

Class SAX:SAX-PARSER-MIXIN ()
A mixin for sax handler classes that records the sax handler object for use with the following functions. Trampoline methods are provided that allow those functions to be called directly on the sax-parser-mixin.

Function SAX:SAX-HANDLER (sax-handler-mixin) => sax-handler
Return the sax-parser instance recorded by this handler, or NIL.

Function SAX:LINE-NUMBER (sax-parser)
Return an approximation of the current line number, or NIL.

Function SAX:COLUMN-NUMBER (sax-parser)
Return an approximation of the current column number, or NIL.

Function SAX:SYSTEM-ID (sax-parser)
Return the URI of the document being parsed. This is either the main document, or the entity's system ID while contents of a parsed general external entity are being processed.

Function SAX:XML-BASE (sax-parser)
Return the [Base URI] of the current element. This URI can differ from the value returned by sax:system-id if xml:base attributes are present.

XML Catalogs

External entities (for example, DTDs) are referred to using their Public and System IDs. Usually the System ID, a URI, is used to locate the entity. CXML itself handles only file://-URIs, but many System IDs in practical use are http://-URIs. There are two different mechanims applications can use to allow CXML to locate entities using arbitrary Public ID or System ID:

This section describes XML Catalogs, the second solution. CXML implements Oasis XML Catalogs.

Variable CXML:*CATALOG*
The XML Catalog object consulted by the parser before trying to open an entity. Initially nil.

Variable CXML:*PREFER*
The default "prefer" mode from the Catalog specification, one of :public or :system. Defaults to :public.

Function CXML:MAKE-CATALOG (&optional uris)
Return a catalog object for the catalog files specified.

Function CXML:RESOLVE-URI (uri catalog)
Look up uri in catalog and return the resulting URI, or nil if no match was found.

Function CXML:RESOLVE-EXTID (publicid systemid catalog)
Look up the External ID (publicid, systemid) in catalog and return the resulting URI, or nil if no match was found.

Example:

* (setf cxml:*catalog* nil)
* (cxml:parse-file "test.xhtml" nil)
=> Error: URI scheme :HTTP not supported

* (setf cxml:*catalog* (cxml:make-catalog))
* (cxml:parse-file "test.xhtml" nil)
;; no error!
NIL

Note that parsed catalog files are cached in the catalog object. Catalog files cached do not expire automatically. To ensure that all catalog files are parsed again, create a new catalog object.

SAX Interface

A SAX handler is an arbitrary objects that implements some of the generic functions in the SAX package.  Note that no default handler class is necessary, because all generic functions have default methods which do nothing.  SAX functions are:

Function SAX:START-DOCUMENT (handler)
Function SAX:END-DOCUMENT (handler)

Function SAX:START-ELEMENT (handler namespace-uri local-name qname attributes)
Function SAX:END-ELEMENT (handler namespace-uri local-name qname)
Function SAX:START-PREFIX-MAPPING (handler prefix uri)
Function SAX:END-PREFIX-MAPPING (handler prefix)
Function SAX:PROCESSING-INSTRUCTION (handler target data)
Function SAX:COMMENT (handler data)
Function SAX:START-CDATA (handler)
Function SAX:END-CDATA (handler)
Function SAX:CHARACTERS (handler data)

Function SAX:START-DTD (handler name public-id system-id)
Function SAX:END-DTD (handler)
Function SAX:START-INTERNAL-SUBSET (handler)
Function SAX:END-INTERNAL-SUBSET (handler)
Function SAX:UNPARSED-ENTITY-DECLARATION (handler name public-id system-id notation-name)
Function SAX:EXTERNAL-ENTITY-DECLARATION (handler kind name public-id system-id)
Function SAX:INTERNAL-ENTITY-DECLARATION (handler kind name value)
Function SAX:NOTATION-DECLARATION (handler name public-id system-id)
Function SAX:ELEMENT-DECLARATION (handler name model)
Function SAX:ATTRIBUTE-DECLARATION (handler ename aname type default)

Accessor SAX:ATTRIBUTE-PREFIX (attribute)
Accessor SAX:ATTRIBUTE-NAMESPACE-URI (attribute)
Accessor SAX:ATTRIBUTE-LOCAL-NAME (attribute)
Accessor SAX:ATTRIBUTE-QNAME (attribute)
Accessor SAX:ATTRIBUTE-SPECIFIED-P (attribute)
Accessor SAX:ATTRIBUTE-VALUE (attribute)

Function SAX:FIND-ATTRIBUTE (qname attributes)
Function SAX:FIND-ATTRIBUTE-NS (uri lname attributes)

The entity declaration methods are similar to Java SAX definitions, but parameter entities are distinguished from general entities not by a % prefix to the name, but by the kind argument, either :parameter or :general.

The arguments to sax:element-declaration and sax:attribute-declaration differ significantly from their Java counterparts.

fixme: For more information on these functions refer to the docstrings.