SAX parsing and serialization
This chapter describes CXML's SAX-like parser interface.
The SAX layer is an important concept in CXML that users will
encounter in various situations:
-
To parse into DOM, use the SAX parser as described below with
a DOM builder as the SAX handler. (Refer to make-dom-builder for information about
DOM.)
-
Serialization is done using SAX, too. SAX handlers that
process and consume events without sending them to another
handler are called sinks in CXML. Serialization sinks
write XML output for the events they receive. For example, to
serialize DOM, use map-document to turn the DOM
document into SAX events together with a sink for
serialization.
-
SAX handlers can be chained together. Various SAX handlers
are offered that can be used in this way, transforming SAX
events before handing them to the next handler. This includes
handlers for whitespace removal, namespace
normalization, and rod-to-string recoding.
However, SAX events are easier to generate than to process. That
is why CXML offers Klacks, a "pull-based" API in addition to SAX.
Klacks events are generally easier to process than to generate.
Please refer to the Klacks documentation
for details.
Parsing and Validating
Old-style convenience functions:
Function CXML:PARSE-FILE (pathname handler &key ...)
Same as cxml:parse with a pathname argument.
(But note that cxml:parse-file interprets string
arguments as namestrings, while cxml:parse expects
literal XML documents.)
Function CXML:PARSE-STREAM (stream handler &key ...)
Same as cxml:parse with a stream argument.
Function CXML:PARSE-OCTETS (octets handler &key ...)
Same as cxml:parse with an octet vector argument.
Function CXML:PARSE-ROD (rod handler &key ...)
Same as cxml:parse with a string argument.
New all-in-one parser interface:
Function CXML:PARSE (input handler &key ...)
Parse an XML document, where input is a string, pathname, octet
vector, or stream.
Return values from this function depend on the SAX handler used.
Arguments:
-
input -- one of:
-
pathname -- a Common Lisp pathname.
Open the file specified by the pathname and parse its
contents as an XML document.
-
stream -- a Common Lisp stream with element-type
(unsigned-byte 8).
-
octets -- an (unsigned-byte 8) array.
The array is parsed directly, and interpreted according to the
encoding it specifies.
-
string/rod -- a rod (or string on
unicode-capable implementations).
Parses an XML document from the input string that has already
undergone external-format decoding.
-
stream -- a Common Lisp stream with element-type
(unsigned-byte 8)
-
octets -- an (unsigned-byte 8) array
-
handler -- a SAX handler
Common keyword arguments:
-
validate -- A boolean. Defaults to
nil. If true, parse in validating mode, i.e. assert that
the document contains a DOCTYPE declaration and conforms to the
DTD declared.
-
dtd -- unless nil, an extid instance
specifying the external subset to load. This options overrides
the extid specified in the document type declaration, if any.
See below for make-extid. This option is useful
for verification purposes together with the root
and disallow-internal-subset arguments.
-
root -- the expected root element
name, or nil (the default).
-
entity-resolver -- nil or a function of two
arguments which is invoked for every entity referenced by the
document with the entity's Public ID (a rod) and System ID (an
URI object) as arguments. The function may either return
nil, CXML will then try to resolve the entity as usual.
Alternatively it may return a Common Lisp stream specialized on
(unsigned-byte 8) which will be used instead. (It may
also signal an error, of course, which can be useful to prohibit
parsed XML documents from including arbitrary files readable by
the parser.)
-
disallow-internal-subset -- a boolean. If true, signal
an error if the document contains an internal subset.
-
recode -- a boolean. (Ignored on Lisps with Unicode
support.) Recode rods to UTF-8 strings. Defaults to true.
Make sure to use utf8-dom:make-dom-builder if this
option is enabled and rune-dom:make-dom-builder
otherwise.
Note: parse-rod assumes that the input has already been
decoded into Unicode runes and ignores the encoding
specified in the XML declaration, if any.
Function CXML:PARSE-EMPTY-DOCUMENT (uri qname handler &key public-id system-id entity-resolver recode)
Simulate parsing a document with a document element qname
having no attributes except for an optional namespace
declaration to uri. If an external ID is specified
(system-id, public-id), find, parse, and report
this DTD as if with parse-file, using the specified
entity resolver.
Function CXML:PARSE-DTD-FILE (pathname)
Function CXML:PARSE-DTD-STREAM (stream)
Parse declarations
from a stand-alone file and return an object representing the DTD,
suitable as an argument to validate.
-
pathname -- a Common Lisp pathname
-
stream -- a Common Lisp stream with element-type
(unsigned-byte 8)
Function CXML:MAKE-EXTID (publicid systemid)
Create an object representing the External ID composed
of the specified Public ID, a rod or nil, and System ID
(an URI object).
Condition class CXML:XML-PARSE-ERROR ()
Superclass of all conditions signalled by the CXML parser.
Condition class CXML:WELL-FORMEDNESS-VIOLATION (cxml:xml-parse-error)
This condition is signalled for all well-formedness violations.
(Note that, when parsing document that is not well-formed in validating
mode, the parser might encounter validity errors before detecting
well-formedness problems, so also be prepared for validity-error
in that situation.)
Condition class CXML:VALIDITY-ERROR (cxml:xml-parse-error)
Reports the violation of a validity constraint.
Serialization
Serialization is performed using sink objects. There are
different kinds of sinks for output to lisp streams and vectors in
various flavours.
Technically, sinks are SAX handlers that write XML output for SAX
events sent to them. In practise, user code would normally not
generate those SAX events manually, and instead use a function
like dom:map-document or xmls-compat:map-node to serialize an
in-memory document.
In addition to map-document, cxml has a set of
convenience macros for serialization (see below for
with-xml-output, with-element, etc).
Portable sinks:
Function CXML:MAKE-OCTET-VECTOR-SINK (&rest keys) => sink
Function CXML:MAKE-OCTET-STREAM-SINK (stream &rest keys) => sink
Function CXML:MAKE-ROD-SINK (&rest keys) => sink
Only on Lisps with Unicode support:
Function CXML:MAKE-STRING-SINK -- alias for cxml:make-rod-sink
Function CXML:MAKE-CHARACTER-STREAM-SINK (stream &rest keys) => sink
Only on Lisps without Unicode support:
Function CXML:MAKE-STRING-SINK/UTF8 (&rest keys) => sink
Function CXML:MAKE-CHARACTER-STREAM-SINK/UTF8 (stream &rest keys) => sink
Return a SAX serialization handle.
-
The -octet- functions write the document encoded into
UTF-8.
make-octet-stream-sink works with Lisp streams of
element-type (unsigned-byte 8).
make-octet-vector-sink returns a vector of
(unsigned-byte 8).
-
make-character-stream-sink works with character
streams. It serializes the document into characters without
encoding it into an external format. When using these
functions, take care to avoid encoding the result into
an incorrect external format. (Note that characters undergo
external format conversion when written to a character stream.
If the document's XML declaration specifies an encoding, make
sure to specify this encoding as the external format if and when
writing the serialized document to a character stream. If the
document does not specify an encoding, either UTF-8 or UTF-16
must be used.) This function is available only on Lisps with
unicode support.
-
make-rod-sink serializes the document into a vector of
runes without encoding it into an external format.
(On Lisp with unicode support, the result will be a string;
otherwise, a vector of character codes will be returned.)
The warnings given for make-character-stream-sink
apply to this function as well.
-
The /utf8 functions write the document encoded into
characters representing a UTF-8 encoding.
When using these functions, take care to avoid encoding the
result into an external format for a second time. (Note
that characters undergo external format conversion when written
to a character stream. Since these functions already perform
external format conversion, make sure to specify an external
format that does "nothing" if and when writing the serialized document
to a character stream. ISO-8859-1 external formats usually
achieve the desired effect.)
make-character-stream-sink/utf8 works with character streams.
make-string-sink/utf8 returns a string.
These functions are available only on Lisps without unicode support.
Keyword arguments:
-
canonical -- canonical form, one of NIL, T, 1, 2
-
indentation -- indentation level. An integer or nil.
-
encoding -- the character encoding to use. A string or
keyword. nil is also allowed and means UTF-8.
-
omit-xml-declaration-p -- Boolean. Don't write an XML
declaration.
The following canonical values are allowed:
An internal subset will be included in the result regardless of
the canonical setting. It is the responsibility of the
caller to not report an internal subset for
canonical <= 1, or only notations as required for
canonical = 2. For example, the
include-doctype argument to dom:map-document
should be set to nil for the former behaviour and
:canonical-notations for the latter.
With an indentation level, pretty-print the XML by
inserting additional whitespace. Note that indentation
changes the document model and should only be used if whitespace
does not matter to the application.
Macro CXML:WITH-XML-OUTPUT (sink &body body) => sink-specific result
Macro CXML:WITH-NAMESPACE ((prefix uri) &body body) => result
Macro CXML:WITH-ELEMENT (qname &body body) => result
Macro CXML:WITH-ELEMENT* ((prefix lname) &body body) => result
Function CXML:ATTRIBUTE (qname value) => value
Generic Function CXML:UNPARSE-ATTRIBUTE (value) => string
Function CXML:ATTRIBUTE* (prefix lname value) => value
Function CXML:TEXT (data) => data
Function CXML:CDATA (data) => data
Function CXML:doctype (name public-id system-id &optional internal-subset)
Convenience syntax for event-based serialization.
Example:
(with-xml-output (make-octet-stream-sink stream :indentation 2 :canonical nil)
(with-element "foo"
(attribute "xyz" "abc")
(with-element "bar"
(attribute "blub" "bla"))
(text "Hi there.")))
Prints this to stream:
<foo xyz="abc">
<bar blub="bla"></bar>
Hi there.
</foo>
Macro XHTML-GENERATOR:WITH-XHTML (sink &rest forms)
Macro XHTML-GENERATOR:WRITE-DOCTYPE (sink)
Macro with-xhtml is a modified version of
Franz' htmlgen works as a SAX driver for XHTML.
It aims to be a plug-in replacement for the html macro.
xhtmlgen is included as contrib/xhtmlgen.lisp in
the cxml distribution. Example:
(let ((sink (cxml:make-character-stream-sink *standard-output*)))
(sax:start-document sink)
(xhtml-generator:write-doctype sink)
(xhtml-generator:with-html sink
(:html
(:head
(:title "Titel"))
(:body
((:p "style" "font-weight: bold")
"Inhalt")
(:ul
(:li "Eins")
(:li "Zwei")
(:li "Drei")))))
(sax:end-document sink))
Miscellaneous SAX handlers
Function CXML:MAKE-VALIDATOR (dtd root)
Create a SAX handler which validates against a DTD instance.
The document's root element must be named root.
Used with dom:map-document, this validates a document
object as if by re-reading it with a validating parser, except
that declarations recorded in the document instance are completely
ignored.
Example:
(let ((d (parse-file "~/test.xml" (cxml-dom:make-dom-builder)))
(x (parse-dtd-file "~/test.dtd")))
(dom:map-document (cxml:make-validator x #"foo") d))
Class CXML:BROADCAST-HANDLER ()
Accessor CXML:BROADCAST-HANDLER-HANDLERS
Function CXML:MAKE-BROADCAST-HANDLER (&rest handlers)
broadcast-handler is a SAX handler which passes every event it
receives on to each of several chained handlers, somewhat similar
to the way a broadcast-stream works.
You can subclass broadcast-stream to modify the events
before they are being passed on. Define methods on your handler
class for the events to be modified. All other events will pass
through to the chained handlers unmodified.
Broadcast handler functions return the result of calling the event
function on the last handler in the list. In particular,
the overall result from sax:end-document will be ignored
for all other handlers.
Class CXML:SAX-PROXY (broadcast-handler)
Accessor CXML:PROXY-CHAINED-HANDLER
sax-proxy is a subclass of broadcast-handler
which sends events to exactly one chained handler. This class is
still included for compatibility with older versions of
CXML which did not include the more
general broadcast-handler yet, but has been retrofitted
as a subclass of the latter.
Accessor CXML:MAKE-NAMESPACE-NORMALIZER (next-handler)
Return a SAX handler that performs DOM
3-style namespace normalization on attribute lists in
start-element events before passing them on the next
handler.
Function CXML:MAKE-WHITESPACE-NORMALIZER (chained-handler &optional dtd)
Return a SAX handler which removes whitespace from elements that
have element content and have not been declared to
preserve space using an xml:space attribute.
Example:
(cxml:parse-file "example.xml"
(cxml:make-whitespace-normalizer (cxml-dom:make-dom-builder))
:validate t)
Example input:
<!DOCTYPE test [
<!ELEMENT test (foo,bar*)>
<!ATTLIST test a CDATA #IMPLIED>
<!ELEMENT foo #PCDATA>
<!ELEMENT bar (foo?)>
<!ATTLIST bar xml:space (default|preserve) "default">
]>
<test a='b'>
<foo> </foo>
<bar> </bar>
<bar xml:space="preserve"> </bar>
</test>
Example result:
<test a="b"><foo> </foo><bar></bar><bar xml:space="preserve"> </bar></test>
Recoders
Recoders are a mechanism used by CXML internally on Lisp implementations
without Unicode support to recode UTF-16 vectors (rods) of
integers (runes) into UTF-8 strings.
User code does not usually need to deal with recoders in current
versions of CXML.
Function CXML:MAKE-RECODER (chained-handler recoder-fn)
Return a SAX handler which passes all events on to
chained-handler after converting all strings and rods
using recoder-fn, a function of one argument.
Caching of DTD Objects
To avoid spending time parsing the same DTD over and over again,
CXML can cache DTD objects. The parser consults
cxml:*dtd-cache* whenever it is looking for an external
subset in a document which does not have an internal subset and
uses the cached DTD instance if one is present in the cache for
the System ID in question.
Note that DTDs do not expire from the cache automatically.
(Future versions of CXML might introduce automatic checks for
outdated DTDs.)
Variable CXML:*DTD-CACHE*
The DTD cache object consulted by the parser when it needs a DTD.
Function CXML:MAKE-DTD-CACHE ()
Return a new, empty DTD cache object.
Variable CXML:*CACHE-ALL-DTDS*
If true, instructs the parser to enter all DTDs that could have
been cached into *dtd-cache* if they were not cached
already. Defaults to nil.
Reader CXML:GETDTD (uri dtd-cache)
Return a cached instance of the DTD at uri, if present in
the cache, or nil.
Writer CXML:GETDTD (uri dtd-cache)
Enter a new value for uri into dtd-cache.
Function CXML:REMDTD (uri dtd-cache)
Ensure that no DTD is recorded for uri in the cache and
return true if such a DTD was present.
Function CXML:CLEAR-DTD-CACHE (dtd-cache)
Remove all entries from dtd-cache.
fixme: thread-safety
Location information
Class SAX:SAX-PARSER ()
A class providing location information through an
implementation-specific subclass. Parsers will use
sax:register-sax-parser to pass their parser instance to
the handler. The easiest way to receive sax parsers instances is
to inherit from sax-parser-mixin when defining a sax handler.
Class SAX:SAX-PARSER-MIXIN ()
A mixin for sax handler classes that records the sax handler
object for use with the following functions. Trampoline methods
are provided that allow those functions to be called directly on
the sax-parser-mixin.
Function SAX:SAX-HANDLER (sax-handler-mixin) => sax-handler
Return the sax-parser instance recorded by this handler, or NIL.
Function SAX:LINE-NUMBER (sax-parser)
Return an approximation of the current line number, or NIL.
Function SAX:COLUMN-NUMBER (sax-parser)
Return an approximation of the current column number, or NIL.
Function SAX:SYSTEM-ID (sax-parser)
Return the URI of the document being parsed. This is either the
main document, or the entity's system ID while contents of a parsed
general external entity are being processed.
Function SAX:XML-BASE (sax-parser)
Return the [Base URI] of the current element. This URI can differ from
the value returned by sax:system-id if xml:base
attributes are present.
XML Catalogs
External entities (for example, DTDs) are referred to using their
Public and System IDs. Usually the System ID, a URI, is used to
locate the entity. CXML itself handles only file://-URIs, but
many System IDs in practical use are http://-URIs. There are two
different mechanims applications can use to allow CXML to locate
entities using arbitrary Public ID or System ID:
-
User-defined entity resolvers can be used to open entities using
arbitrary protocols. For example, an entity resolver could
handle all System-IDs with the http scheme using some
HTTP library. Refer to the description of the
entity-resolver keyword argument to parser functions (see cxml:parse-file) to more
information on entity resolvers.
-
XML Catalogs are (local) tables in XML syntax which map External
IDs to alternative System IDs. If, say, the xhtml DTD is
present in the local file system and the local copy has been
registered with the XML catalog, CXML will use the local copy of
the DTD instead of trying to open the version available using HTTP.
This section describes XML Catalogs, the second solution. CXML
implements Oasis
XML Catalogs.
Variable CXML:*CATALOG*
The XML Catalog object consulted by the parser before trying to
open an entity. Initially nil.
Variable CXML:*PREFER*
The default "prefer" mode from the Catalog specification, one
of :public or :system. Defaults
to :public.
Function CXML:MAKE-CATALOG (&optional uris)
Return a catalog object for the catalog files specified.
Function CXML:RESOLVE-URI (uri catalog)
Look up uri in catalog and return the
resulting URI, or nil if no match was found.
Function CXML:RESOLVE-EXTID (publicid systemid catalog)
Look up the External ID (publicid, systemid)
in catalog and return the resulting URI, or nil
if no match was found.
Example:
* (setf cxml:*catalog* nil)
* (cxml:parse-file "test.xhtml" nil)
=> Error: URI scheme :HTTP not supported
* (setf cxml:*catalog* (cxml:make-catalog))
* (cxml:parse-file "test.xhtml" nil)
;; no error!
NIL
Note that parsed catalog files are cached in the catalog object.
Catalog files cached do not expire automatically. To ensure that
all catalog files are parsed again, create a new catalog object.
SAX Interface
A SAX handler is an arbitrary objects that implements some of the
generic functions in the SAX package. Note that no default
handler class is necessary, because all generic functions have default
methods which do nothing. SAX functions are:
Function SAX:START-DOCUMENT (handler)
Function SAX:END-DOCUMENT (handler)
Function SAX:START-ELEMENT (handler namespace-uri local-name qname attributes)
Function SAX:END-ELEMENT (handler namespace-uri local-name qname)
Function SAX:START-PREFIX-MAPPING (handler prefix uri)
Function SAX:END-PREFIX-MAPPING (handler prefix)
Function SAX:PROCESSING-INSTRUCTION (handler target data)
Function SAX:COMMENT (handler data)
Function SAX:START-CDATA (handler)
Function SAX:END-CDATA (handler)
Function SAX:CHARACTERS (handler data)
Function SAX:START-DTD (handler name public-id system-id)
Function SAX:END-DTD (handler)
Function SAX:START-INTERNAL-SUBSET (handler)
Function SAX:END-INTERNAL-SUBSET (handler)
Function SAX:UNPARSED-ENTITY-DECLARATION (handler name public-id system-id notation-name)
Function SAX:EXTERNAL-ENTITY-DECLARATION (handler kind name public-id system-id)
Function SAX:INTERNAL-ENTITY-DECLARATION (handler kind name value)
Function SAX:NOTATION-DECLARATION (handler name public-id system-id)
Function SAX:ELEMENT-DECLARATION (handler name model)
Function SAX:ATTRIBUTE-DECLARATION (handler ename aname type default)
Accessor SAX:ATTRIBUTE-PREFIX (attribute)
Accessor SAX:ATTRIBUTE-NAMESPACE-URI (attribute)
Accessor SAX:ATTRIBUTE-LOCAL-NAME (attribute)
Accessor SAX:ATTRIBUTE-QNAME (attribute)
Accessor SAX:ATTRIBUTE-SPECIFIED-P (attribute)
Accessor SAX:ATTRIBUTE-VALUE (attribute)
Function SAX:FIND-ATTRIBUTE (qname attributes)
Function SAX:FIND-ATTRIBUTE-NS (uri lname attributes)
The entity declaration methods are similar to Java SAX
definitions, but parameter entities are distinguished from
general entities not by a % prefix to the name, but by
the kind argument, either :parameter or
:general.
The arguments to sax:element-declaration and
sax:attribute-declaration differ significantly from their
Java counterparts.
fixme: For more information on these functions refer to the docstrings.