XOM Release Notes

XOM is a new XML object model. It is an open source (LGPL), tree-based API for processing XML with Java that strives for correctness and simplicity.

1.2.10

Support the built-in Android parser.

1.2.9

Exclude org.w3c.dom from Jaxen files we copy in to avoid problems with some application servers.

Upgrade Jaxen to 1.1.6 to fix some IEEE-754 bugs involving -0.

1.2.8

Upgraded to Jaxen 1.1.4 to fix several XPath bugs involving function resolution and Java 7 compatibility.

1.2.7

Canonical XML 1.1

1.2.6

Fixes a bug that doubled query strings in base URLs.

Upgraded to Jaxen 1.1.3 to fix an XPath bug evaluating relational operators when one of the operands was a text, comment, or processing instruction node.

1.2.5

Throws NullPointerException instead of MalformedUriException when a null Reader is passed to Builder.build().

Maven 2 support

1.2.4 Release

More automatic deploy process.

Fixed maven targets.

Slight optimization to XPath by combining two loops.

1.2.3

Bug fix for some obscure corner cases.

1.2.2

This release focuses on improved packaging with Maven and OSGI. Otherwise, no visible changes.

1.2.1

A very minor release that now prints the correct version number when you execute the JAR archive by typing java -jar xom.jar

1.2

The 1.2 release fixes a number of bugs, especially in canonicalization and XPath. However there's at least one bug fix in the core so I recommend all users upgrade. XOM 1.2 should be fully backwards compatible with code written to 1.0 and 1.1 APIs. 1.2 should also be somewhat easier to compile and edit due to various changes with UnicodeUtil and Jaxen. Actual new features in this release are fairly minor and include:

Latest Unicode normalization tables.
Upgraded to Jaxen 1.1.2
xml:id attributes no longer checked for NCNames
Upgraded to Xerces 2.8.0, DTD-only version
DOMConverter can accept a NodeFactory to be used in creating the XOM document
A lookup method in XPathContext that finds the namespace URI for a prefix.

1.1

New features implemented since 1.0 include:

XPath
a setInternalDTDSubset method in DocType
Document subset canonicalization
Exclusive XML canonicalization
xml:id support
Parameters can be passed to XSL transforms
Entity declarations are preserved in the internal DTD subset.

Memory usage has been reduced, and performance improved by up to 2-4 times for some common operations. In addition, some bugs have been fixed in XOMTestCase and in the handling of a few edge conditions in the internal DTD subset. Furthermore, 1.1 works around quite a few more bugs in Crimson.

1.0

Essentially the same as Beta 11. The README file was improved slightly and all version numbers in the JavaDoc have been upgraded to 1.0. A number of small edits have been made to the API documentation. The only API-level change is that the deprecated setNodeFactory method in XSLTransform has been removed.

1.0b11/RC5

Beta 11 is the fifth release candidate. It restores the three servlet samples (FibonacciServlet, FibonacciSOAPServlet, and FibonacciXMLRPCServlet) but uses Ant conditions to only compile these files if the servlet classes are present. It also adds README, LICENSE, and LGPL files to the core distribution rather than simply placing these on the web site. Finally, http://www.cafeconleche.org/XOM/ has been replaced by http://www.xom.nu/ in the source code and documentation. The core API has not changed at all.

1.0b10/RC4

Beta 10 is the fourth release candidate. It removes three samples (FibonacciServlet, FibonacciSOAPServlet, and FibonacciXMLRPCServlet) to avoid having to distribute servlet.jar with XOM. It also modifies the Ant build file so the tools package is not compiled except when generating the betterdoc target. This makes the complete distribution more self-contained and easier to build. The core API has not changed at all.

1.0b9/RC3

Beta 9 is the third release candidate. It adds a few more unit tests and fixes some packaging issues that were bedeviling Windows system. (The zip and tar files no longer contain any test files whose names are legal on Unix but illegal on Windows.) Barring discovery of any last-minute bugs, this will be XOM 1.0. No further optimizations or fixes are planned before 1.0. All the changes are restricted to the tests package. The core API has not changed at all.

1.0b8/RC2

Beta 8 is the second release candidate. Barring discovery of any last-minute bugs, this will be XOM 1.0. No further optimizations or fixes are planned before 1.0. Changes in this release include:

The TagSoup and servlet JARs are no longer bundled. They're not needed to run XOM, just for one of the samples and for the JavaDoc
A few more optimizations to speed up the checking of namespace URIs, and a variety of other operations.

1.0b7/RC1

Beta 7 is the first release candidate. There are still a few open issues with regard to error handling in XInclude that require clarification from the XInclude working group. If they decide that how XOM currently behaves is correct, then XOM 1.0 is essentially complete. If they decide to require different behavior a few changes may yet need to be made.

Changes in this release include:

Builder is considerably more robust against buggy parsers. It converts all runtime exceptions thrown by such a parser (including XOM XMLExceptions thrown by a NodeFactory) into ParsingExceptions. It uses a verifying factory for Saxon 7's AElfred derivative.
Comment data is now allowed to begin with a hyphen.
XIncluder treats bad encoding attributes as fatal errors
Various optimizations have sped up a lot of common operations including getValue(), toXML(), DOM and SAX conversion, canonicalization, and XSL transformation by roughly a factor of two.
The zip archives and CVS no longer contain files with names that are problematic on Windows.
The manifest file is now versioned.
In keeping with the recommendation in RFC2396bis that "For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings", XOM now uses uppercase percent encodings for base URIs. There may still be a few places where lower case escapes are used. Holler if you spot any.
Fixed bug where base URIs were not encoded in UTF-8 on all platforms. Mac OS X 10.3 was the particular offender here. Surprisingly the problem did not manifest on Mac OS X 10.2.

1.0b6

Beta 6 is primarily a bug fix release. It also polishes off some rough edges in various corners of the API. Changes in this release include:

The deprecated setNodeFactory() method in XSLTransform has been removed. This is the only API-level change in this release.
The strings returned by toString in Comment, ProcessingInstruction, Attribute, and Text are all now truncated if they get too long. Furthermore any embedded line breaks and tabs are escaped as \n, \r, and \t. This makes the objects easier to inspect in various debuggers and loggers.
SAXConverter no longer converts XOM xml:base attributes into SAX attributes. Instead the xml:base attributes are used to determine the URI information the Locator reports. Providing xml:base attributes as well would risk double counting some relative URLs.
Fixed bug where carriage returns in internal entity replacement text in the internal DTD subset was not properly escaped on reserialization
Fixed bug where carriage returns, less than signs, double quotes, and ampersands in attribute default values in the internal DTD subset were not properly escaped on reserialization.
Fixed a number of bugs in converting file names to base URIs
Improved compatibility with Turkish locales that do not see I as the upper case form of i or vice versa.
Fixed a bug in Serializer that did not always properly trim whitespace
Hid the error messages logged by Xerces and Xalan on System.err when deliberately testing error conditions. Therefore, there should be no output from the test cases when all tests pass.
Added a junithtml build target to convert JUnit results to HTML.
The Ant build file now specifies that the input encoding of all .java files is UTF-8. Most files are pure ASCII, but there are a couple of places where non-ASCII characters are used.
Unit test coverage has been improved.

1.0b5

Beta 5 primarily focuses on fixing bugs in XInclude and improving performance of builders when reading from files. It also deprecates the setNodeFactory() method in XSLTransform which will be removed in the next release. In its place, there's a new constructor:

public XSLTransform(Document stylesheet, NodeFactory factory)

Finally, the four XSLTransform constructors deprecated in the last release have been removed.

1.0b4

1.0b4 primarily focuses on fixing bugs and improving performance in the converters and XSLT package. XSLT transformation can now work directly from a XOM Document without an intermediate step that serializes the Document as a string. Consequently, these four constructors in XSLTransform have been deprecated and will be removed in the next release:

public XSLTransform(InputStream stylesheet)
public XSLTransform(Reader stylesheet)
public XSLTransform(String URL)
public XSLTransform(File stylesheet)

Other changes include:

SAXConverter can now convert Nodes lists as well as Documents.
SAXConverter now sets a Locator that provides system IDs for individual elements.
The toXML methods now use \n as the line separator, since this is more likely to match the contents of text nodes created by parsing an XML document. The goal is to minimize the number of documents with mixed line break strings.
Fixed a bug in DOMConverter that threw a NullPointerException when converting XOM documents with only a single element to DOM.
Line breaks in the internal DTD subset are handled more reliably.

1.0b3

The primary impetus for beta 3 is fixing a few bugs in the DOMConverter. Also, Java encoding names like "8859_1" are now recognized when using the repackaged Xerces bundled with Java 1.5 I also spell checked the comments. :-)

1.0b2

The primary impetus for beta 2 is fixing some bugs that prevented the XOM-specific parsers from being loaded in Java 1.5 when the standard Xerces (as opposed to the Java 1.5 bundled Xerces) was not in the classpath.

This release also makes the JavaDoc well-formed (and possibly valid, I haven't checked) XHTML.

1.0b1

Beta 1 is feature and code complete. There are no known bugs in XOM. All that remains to be done is finishing the documentation and doing some minor code clean-ups. These include such housekeeping tasks as splitting long lines, spell checking the comments, and making sure the Javadoc is all valid XHTML. None of this should have any affect on client code. XOM is now believed to be ready for serious, production use.

Unless new bugs are uncovered, this may be the one and only beta release. Possibly I'll do some profiling runs to see if there are any more areas where I can save some memory or speed up some operations. Barring that, all that's needed before the final 1.0 release is finished documentation.

Beta 1 makes no backwards incompatible changes to the published API. Changes since the final alpha include:

The XInclude test suite is loaded and run from the W3C CVS server if it's not installed locally. Mistakes in the test suite (mostly involving document type declarations) are corrected on the fly.
Work-arounds for various JDK bugs that prevent round-tripping of some characters in Japanese encodings
Work-arounds for bugs in some versions of Xalan, as well for bugs in the OASIS XSLT conformance test suite.
Improved compatibility with Java 1.5

1.0a5

1.0a5 makes no backwards incompatible changes to the published API. Changes since the previous release include:

The ParsingException and ValidityException classes now have a getURI() method that returns the URI of the document whose error caused the exception.
Test suite now runs OASIS Microsoft and Xalan XSLT tests
Improved compatibility with Java 1.2
Improved compatibility with recent releases of Xalan, including those bundled with JDK 1.4.2_03 and later

1.0a4

1.0a4 makes no backwards incompatible changes to the published API. Changes since the previous release include:

Nodes.remove(int) now returns the node removed.
The IBM virtual machine 1.4.1 is no longer special cased.
The API documentation has undergone extensive editing.
The unpublished nu.xom.xerces package has been removed.

1.0a3

1.0a3 makes no backwards incompatible changes to the published API. It adds one new protected method. Changes since the previous release include:

The Element copy constructor and copy methods are no longer recursive, so they shouldn't cause stack overflows in deep documents. This necessitated adding a protected shallowCopy() method that can be used to create an instance of a subclass of Element. Overriding this is preferred to overriding copy() when one wishes to maintain the objects' types after a copy.
The getBaseURI() method is also no longer recursive.
The W3C XML Schema Language and WML and HTML DOMs have been removed from the bundled version of Xerces to save space.
XOM now uses character references only when necessary for all encodings supported by the local virtual machine. However, this may be quite a bit slower than the explicitly supported encodings like UTF-8 and the ISO-8859 character sets. Measurements remain to be performed.

1.0a2

1.0a2 makes no changes to the published API. Behavioral changes since the previous release include:

URI verification and base URI resolution are now performed according to the RFC2396bis algorithm, rather than by using the Xerces and java.net URI classes.
The Builder no longer sets any Java system properties for improved compatibility with applets and multiclassloader environments.
A bug in DOMConverter was fixed

1.0a1

1.0a1 is the first alpha release of XOM. The API is now considered to be reasonably stable and frozen. I may add to the API in the future, but the current API will not change without a very good reason. Most features should work pretty much as intended. There are no API changes since 1.0d25. Behavioral changes since the previous release include:

XOM now fully supports the 2nd candidate recommendation syntax for XInclude; including preservation of xml:lang values.
The base URI handling has been modified as follows:
1. getBaseURI() always returns an absolute URI or the empty string if the base URI is not known. Other than the empty string it never returns a relative URI. It never returns null.
2. The base URI of an element does not change when it is detached or copied.
3. The setBaseURI() method only accepts an absolute URI. It throws a MalformedURIException if you attempt to pass it a relative URI, or a URI with a fragment identifier. (Relative URIs are still allowed in xml:base attributes.)
XOM will not double verify when being fed data through Norm Walsh's catalog filter; provided that the underlying parser is good.
.
Constraints on parentage are not checked when building with NonVerifyingFactory.
DOMConverter and several methods have been rewritten with non-recursive algorithms. Some work remains to be done in this area, however.

There appear to be some bugs in Sun's JDK 1.4.2_03 that break about 5 or 6 of the unit tests. All tests pass with JDK 1.4.2_02 and JDK 1.5.0a1. Ant 1.5.x is required to build XOM. I have been unable to get the tests to run with Ant 1.6, and the Ant developers seem actively hostile to any reports about this issue.

1.0d25

1.0d25 is the second last call release of XOM. I had planned for this to be alpha 1 and API freeze. However, enough changes since the last release were discovered to be necessary, that I decided to make this 1.0d25 instead. Anything that didn't change since the last release is probably pretty stable. However, there have been some new changes in this release that are worth reviewing and may change again:

All 21 protected checkFoo methods have been removed. Instead the various mutator methods (setters and other methods that change the state of an object are now non-final so they can be overridden. The getter methods are stil final and the fields are all private. Thus to change the state of an object setter methods will till need to call the constraint-verifying superclass mehtods. This should give subclasses a lot more flexibility while not compromising on well-formedness.
The Serializer now throws UnavailableCharacterException, a subclass of XMLException, instead of a raw XMLException when it encounters a character it can neither write nor escape in the current encoding.
NodeFactory.makeDocument has been renamed startMakingDocument. NodeFactory.endDocument has been renamed finishMakingDocument.
Added a method to DOMConverter that converts a DocumentFragment to a Nodes.
Added XSLTransform.toDocument() method that converts a Nodes to a Document.
Element.removeChildren() now returns a Nodes object containing the children removed.
The LeafNode class has been removed. DocType, Text, Comment, and ProcessingInstruction now directly extend Node.
Removed the hasChildren method from Element, Node, ParentNode, Attribute and Document.
Element.addAttribute is declared to throw the more specific MultipleParentException instead of IllegalAddException

There are also several changes that do not affect the API

ParentNode.replaceChild() will not remove the old child unless it can insert the new child. It can no longer do one but not the other.
Document.replaceChild now allows replacing of the DocType by another DocType or the root element by another element
Many methods including getValue and toXML have been rewritten using non-recursive algorithms so they are no longer limited by Java's stack size. The samples package includes an example of a non-recursive serializer.
Much better testing of canonicalizer. I am now fairly convinced it is correct in all or almost all cases.
Line breaks are now used between declarations in internal DTD subset
The JAR is compiled without debugging symbols to save space. (These can be turned on again easily enough in build.xml if anyone needs them.)
Added a XOMSamples.jar archive that includes all the sample code
The core JAR archive is sealed.
The API documentation has been thoroughly proof-read from start to finish.

1.0d24

1.0d24 is a very fast release to fix a bug that prevented 1.0d23 from being used in multi-classloader environments like Tomcat. A couple of bugs that prevented some of the test cases from successfully completing on Windows have also been fixed, a bug in the FibonacciServlet sample was corrected, and some of the documentation has been improved. The API has not changed at all. XOM is still in "last call".

1.0d23

This is the last call, pre-alpha release of XOM. My plan is that the next release will be the official API freeze for 1.0. While nothing is written in stone, I do plan to strenuously resist any backwards incompatible changes in the API after the next release (1.0a1). If you have any concerns about the API, now is the time to get them in.

There are several backwards incompatible changes in this release. Most notably, the various makeNode() methods in the NodeFactory class all return Nodes objects. This means a factory can replace one node type with a different node type (e.g. changing elements into attributes and vice versa) or replace a single node with several nodes.

Oher changes that may require code modifications include:

Attribute.Type.toXML is now Attribute.Type.getName(). This was necessary to be consistent with handling attributes of type ENUMERATION, which is not a DTD keyword though it is referenced in the Infoset.
Support for the November 2003 Working Draft syntax of XInclude, including the xpointer, accept, accept-charset, and accept-language attributes. Documents will need to be rewritten to use the new syntax. In keeping with the terminology in the new working draft, MissingHrefException has been renamed NoIncludeLocationException. CircularIncludeExcepion has been renamed InclusionLoopException. The methods that resolve Nodes objects have been marked private.
NamespaceException has been broken up. IllegalNameException is used for problems with a namespace prefix. MalformedURIException is used for problems with a namespace URI. NamespaceConflictException, a subclass of WellformednessException, is used for cases where attributes, elements, and/or additonal namespace declarations have conflicting bindings for the same prefix.
Removed NodeFactory's makeWhiteSpaceInElementContent() method
Removed no-args constructors from the various exception classes.

More or less backwards compatible changes in 1.0d23 include

IllegalDataException and its subclasses have getData and setData methods to get and set the exact text that caused the exception. Subclasses include IllegalNameException, IllegalTargetException, and IllegalCharacterDataException. IllegalCharacterDataException is now used where IllegalDataException was used previously.
XOMTestCase is part of the published API.
Factory methods are now invoked in document order. Previously this wasn't true for text nodes, which weren't flushed until after the next tag, processing instruction, etc. This was necessary to enable text nodes to be maximally contiguous, though in fact they might not be if the factory returned several text nodes in a row for non-text nodes. In any case, with the default factory, or with a custom factory that does not remove any nodes or change their base types (e.g. coment to Text) text nodes are still hold the maximum possible contiguous run of text after a build.
Added support for GB18030 (Chinese) and ISO-8859-11/TIS-620 (Thai) encoding on output (requires Java 1.4)
Verifier is now based on table lookup.
All JDOM code has been removed.
Serialization speed-ups for Non-Unicode, non-Latin-1 encodings
It is now possible to supply a NodeFactory to XSLTransform to be used for constructing nodes in the result tree
Improved support for IBM JVM 1.4.1
The Nodes class now has insert and remove methods, in addition to append.
Added NoSuchAttributeException for parallelism with NoSuchChildException
Unit tests have been dramatically expanded. There are now over 700 separate test methods, many of which perform several tests.
No longer allow the namespace URI http://www.w3.org/XML/1998/namespace to have any prefix other than xml, per conformance with the namespaces erratum
Allow the xml: prefix (with the right URI) to be used on elements per conformance with the namespaces recommendation
Better exception messages when name and namespace arguments are swapped
getBaseURI returns null if the base URI can't be determined due to a malformed xml:base attribute.
Upgraded to Xerces 2.6.1 and Xalan 2.5.2

And of course numerous bugs have been fixed, especially in XInclude.

1.0d22

This release collects numerous small new features, refactorings, renamings, unit tests, sample programs, and bug fixes. Many programs will need minor modifications and recompilation to work against this release. Visible changes include:

NodeList has been renamed Nodes.
ParseException has been renamed ParsingException to avoid a conflict with java.text.ParseException
The preserveBaseURI() method in Serializer has been renamed setPreserveBaseURI() in keeping with JavaBeans naming conventions.
Carriage returns are no longer allowed in comment and processing instruction data because they can't be roundtripped. (Character references aren't resolved inside comment and processing instruction data, and the parser will normalize literal carriage returns into linefeeds.)
Initial white space is no longer allowed in processing instruction data because this cannot be roundtripped.
The translate methods in DOMConverter have been renamed convert()
DOMConverter can now convert individual DOM nodes into XOM objects. It is no longer limited to converting entire documents.
ValidityException now has a getDocument() method which returns the complete well-formed but invalid document. It also has getValidityError(int n), getLineNumber(int n), and getColumnNumber(int n) methods which return information about the successive validity errors in the document.
Numeric character references now use upper case.
In Serializer, writeMarkup has been renamed writeRaw and writeText has been renamed writeEscaped since in subclasses these may not actually be writing markup.
Much more fine-grained control of serialization from subclasses using several new methods including writeXMLDeclaration(), writeStartTag(), and writeEmptyElementTag().
Added an option to serialize using Unicode normalization form C.
Added a protected getColumnNumber() method to Serializer to assist subclasses that want to implement their own line breaking strategies.
Can now specify a Builder to be used when XIncluding
More XPointer syntax errors are detected when XIncluding
Java encoding names such as ISO8859_1 are now recognized on input if Xerces is the parser.
XIncludeException (and its subclasses) can now report the URI of the document where the problem was detected
Upgraded to Xerces 2.6 nightly build to fix bug involving relative URL resolution in documents loaded from redirected URLs
Added unit tests for SAXConverter
Added DatabaseBuilder sample based on Example 8-13 from Processing XML with Java
Silently preserve CDATA sections from parse to output when possible.
Added SourceCodeGenerator sample program that converts a well-formed XML document into the XOM statements necessary to create the document.

This is probably the last version that will support the old, XInclude 2002 Candidate Recommendation syntax. The next release will likely support the new 2003 Working Draft syntax.

1.0d21

This release collects a number of small changes, refactorings, and bug fixes. Most programs should continue to work as they did previously without modification or recompilation. Visible changes include:

Added protected checkDetach method in Node which subclasses can override to prevent or track nodes being detached.
The copy method is no longer final in the various node classes such as Element. Subclasses should override this metod to return an instance of the speciifc subclass.
Cycles (an element acting as its own parent or ancestor) are no longer allowed. Attempting to create one throws a CycleException.
NodeFactory.makeDocument() no longer takes an Element as an argument. It is the responsibility of the NodeFactory to construct a suitable root element. However, when parsing this will quickly be replaced by the actual root element.
Serializer.setIndent throws an IllegalArgumentException for negative values
Fixed bug where line breaks would be added if indenting, even in elements for which xml:space="preserve"
XInclude now consistently treats XPointers that don't match any subresource as resource errors, rather than including nothing.
xml:base attributes added to XIncluded elements no longer have fragment IDs
A couple more XPointer syntax errors are now detected when XIncluding
In XIncludeException the getRootCause and setRootCause() methods have been replaced by initCause() and getCause().
The initCause method in the various exception classes now behaves much more consistently with its definition in Java 1.4.
XSLException no longer extends XMLException. This means it is now a checked exception instead of a runtime exception.
Xalan 2.5.1 has replaced Saxon 6.5.2 as the bundled XSLT processor due to a bug in SAXON that incorrectly reported document fragments resulting from XSL transforms
Minor usability improvements and code cleanups in the build.xml file
Added an overview page to the API docs

1.0d20

This release adds a workaround for Java's broken, non-conformant handling of file: URLs on Windows. The problem manifested itself as an inability to resolve relative URLs in documents built with the Builder.build(File) method. This caused the failure of a couple of dozen unit tests. Unix users were not affected (which is why I didn't notice the problem sooner). There are no API-level changes in this release.

The JAR archive is no longer compressed, which means a larger JAR archive but faster class loading on initial startup.

1.0d19

The major API level change in XOM 1.0d19 is in NodeFactory. makeElement has been renamed startMakingElement and endElement has been renamed finishMakingElement. startMakingElement behaves the same as the old makeElement. However, finishMakingElement now has a slightly different contract. if it returns null, the entire element is deleted from the tree. It is no longer necessary to explicitly call detach. If it returns a different element than the one passed to it, then the old element is deleted from the tree and the new one is inserted in its place. This is more consistent with the other methods in this class. Return the node you want added to the tree, or null for no node at all.

The second big change has no API-level impact. By default, the Serializer and toXML methods now use numeric character references to to escape all tabs, carriage returns, and line feeds in attribute values and all carriage returns in text nodes. This helps make round tripping more reilable and robust. However, if the user indicates that white space is not significant by calling either setMaxLength or setIndent, then these characters may not be preserved. If the client calls setLineSeparator, then tabs will still be preserved but carriage returns and line feeds may not be.

There are also several minor improvements and bug fixes:

The unit test code is cleaner, but still needs a lot of work.
The Node.equals() method now executes in about half the time it took in previous releases.
Characters from Planes 1 to 15 are now escaped correctly by the serializer

1.0d18

1.0d18 adds one minor new feature and one major new feature. The minor feature is that nu.xom.tests.XOMTestCase is now public. This class is very useful for comparing two documents or pieces thereof for deep equality. For example, I use it to compare the actual output of the XInclude test cases to the expected outputs. I'm still working on the API and detailed behavior, but I think it's solid enough to be useful for other people's unit testing.

Now the major feature, and this one's way cool: It is now possible to subclass NodeFactory in order to filter and/or stream your processing. XOM can now handle documents of effectively arbitrary size with only slightly more memory use than the underlying SAX parser! I really need to write an article about this style of mixed tree/stream processing, but in the meantime here are the key things you need to know:

To enable filtering or streaming, install your own NodeFactory subclass with the Builder. I've added a couple of constructors to Builder to make this easier.
NodeFactory has one makeNode method for each of XOM's node types. You must return a node of the requested type, but you can change its name, namespace, value, or other characteristics before doing so.
You can eliminate a node from the document simply by returning null from the makeNode method. This saves both the memory needed to store the node and the time required to build it.
To process one element at a time, override endElement() in NodeFactory. This supports streaming. Before the builder calls this method, it has completely built the element with all its content. The usual XOM methods all work on it. You do not have process every element in order to process some. You can do a quick check on the name and namespace of the element (or other characteristics) to figure out what you want to do with it. If you don't want to process the element, just return. For example an XHTML spider could easily look at each a element and ignore all the other elements in the document. Indeed it wouldn't even have had to build them or any of their content in the first place.
If you only need to process an element once, put your processing in the endElement() method and detach() it when you're done. As long as you haven't stored a reference to it somewhere, the element can then be garbage collected as needed. This is how XOM processes documents larger than available memory. This is sort of like SAX callbacks, except it's much more convenient because you have the entire element to work with. You do not need to build a custom data structure to hold onto the content until you're ready to work with it. The element is its own data structure.
Most importantly, if you don't care about all this, you can ignore it. It has no impact on the rest of the API. Adding this functionality just required two new protected methods in NodeFactory and two new constructors in Builder. The rest of the API is unchanged. You can forget about it until you need it.

More details are in the JavaDoc for NodeFactory, and I've written lots of new sample programs that you'll find in the nu.xom.samples package. Many of them are streaming versions of earlier, less memory efficient samples.

This developed from an idea proposed by John Cowan, based on Simon St. Laurent's work with MOE. There have been things like this before, (DOMBuilderFilter in DOM3, MOE, ElementScanner in JDOM, and of course SAX filters) but I don't think any API has done quite as neat a job as XOM now does. This is really powerful stuff. Not only does it make programs faster and much, much smaller. It makes them much easier to write. For instance, you can easily throw away all white space only nodes on build so you're left with only the real content of the document, no more white space nodes getting in the way of your navigation. I urge you to check this out. It will radically change how you think about processing XML.

This release is API compatible with 1.0d17. All programs that compiled in 1.0d17 should still compile in 1.0d18 without any edits.

1.0d17

The is primarily a bug fix release. There are only very minor API changes, the most significant of which is that XSLTransform is final. Other fixes and improvements in this release include:

Added unit tests for toString methods and fixed various bugs thereby uncovered
IPv6 URIs of the form described in RFC 2732 are now allowed
Fixed various bugs in XInclude. It can now process all the test cases that do not use the xpointer() scheme or unparsed entities. In a couple of cases, it's actually conformant to the as yet unpublished XInclude proposed recommendation rather than the published candidate recommendation.
The correct exception is now thrown when validating with Crimson.
You can now build with Crimson in Java 1.4.1 and earlier.
Removed numerous unused local variables thanks to PMD
Removed some duplicate code in Builder and Verifier thanks to Same

1.0d16

The primary focus of this release is adding unit tests for XSLT, and fixing the bugs they uncovered:

More accurate exception messages from the XSLTransform constructors
XSLT unit tests
The distribution now includes the SAXON jar archive so that XSLT works with Java 1.2 and 1.3 VMs.
Fixed a nasty bug in Element.toXML that was making XSLT transforms fail when elements were in the default namespace
You can now transform a NodeList as well as a complete document

Other assorted improvements in this release include:

The standard jar file no longer includes the samples, tests, and benchmarks packages. You can compile these from source if you need them, but omitting them makes the jar file smaller for developers who want to bundle XOM with their own applications.
The jar file is indexed to improve class loading speed.
I moved SAXConverter and DOMConverter out of the core package into a new nu.xom.converters package. They're fairly special purpose.
Improved compatibility with Java 1.2.
SAX filters can no longer bypass well-formedness checks
Worked around a Xerces and Crimson bug that inhibits relative URL resolution from pathless base URLs such as http://www.cafeconleche.org
The FibonacciSOAPClient sample program works now
Document.insertChild(DocType, position) now throws an IllegalAddException if the Document already has a DocType, rather than silently replacing it.

1.0d15

The primary focus of this release is XInclude. To my knowledge, XOM is now completely conformant with with the XInclude candidate recommendation including:

Fallback support
Support for the XPointer bare name and element() schemes
xml:base attributes are added to included elements as necessary to preserve base URI information

I've also written 24 unit tests for XInclude and fixed numerous bugs including one in the Document and Element copy constructors that failed to preserve base URI.

Other changes in this release include:

The Element.getChildElements(String name, String namespaceURI method) now allows a null or empty string local name to stand for any local name, so you can use this method to get all elements in a certain namespace.
Serializer no longer wraps and indents text when xml:space="preserve", regardless of the setting of indents and maxlength.

This release should be completely compatible with code written against 1.0d14. You should not even need to recompile existing programs.

1.0d14

The primary focus of this release is speed. I've done extensive profiling of the CPU times used by XOM, and rearchitected classes to run faster by both macro and micro optimizations. One of the things I discovered was that parsing and serialization are dramatically slower than in-memory manipulations, typically by three orders of magnitude. Right now my belief is that any program that does any parsing or serialization (and it's hard to imagine what program wouldn't do at least one of those two) is going to spend so much time doing that, that nothing else is worth optimizing. Parsing and serialization are typically three orders of magnitude slower than in-memory manipulations, even when all I/O is performed between byte arrays. There's simply no point to optimizing anything else.

That said, I have optimized parsing/document building extensively in this release. It is much, much faster than in previous releases. It should now be competitive with any other tree-based API written in Java, though naturally it's still slower than a straight forward SAX parse because it sits on top of SAX. The biggest effects on speed now are I/O (don't forget to buffer your streams) and the speed of the underlying parser. I'm still recommending Xerces because it's the only I've found that's almost correct, but you can speed XOM up by a factor of a third by switching to Crimson, and possibly more by switching to Piccolo. However, both of those have nasty bugs that prevent the XOM unit tests from completing successfully. Xerces has a couple of bugs too, but fortunately nothing I couldn't work around.

Contrary to popular belief, most of the optimizations improved both speed and memory use. There were few trade-offs between them. However, there was one notable exception. The Text class is now storing its data internally in UTF-8. This cuts memory usage for mostly ASCII text by about 10-20%. However, it has a noticeable 10% speed penalty. I'm not sure if I'm going to keep this strategy or not. Ideally, I'd like to provide some sort of runtime switch to select this behavior (or not) but I haven't yet figured out the right design to make this happen. The constraints on the design are:

There must be a simple public constructor in the Text class
There must be a simple setValue() method in the Text class
Subclasses from outside the nu.xom package cannot bypass verification
Each Text object should not carry around unnecessary fields (a four byte reference per object adds up fast.)
Frequently called methods such as getValue() and setValue() should not use instanceof. (Profiling has shown this is a performance killer.)

There are no public API level changes in this release. However, the unit tests have been expanded dramatically, which resulted in the discovery and elimination of a number of bugs. Internal changes in 1.0d14 include:

There are the beginnings of a new nu.xom.benchmark package, though it's not even close to stable yet. None of the classes in here are public, but you can run the programs to get some rough timing measurements.
Verification is not performed on build. Instead XOM relies on the parser to perform well-formedness checks. This is a significant speed-up.
Worked around some bugs in Xerces that caused the wrong exception to be thrown when validating.
Improved compatibility with Crimson, the default parser in Java 1.4. However, Crimson bugs still prevent the canonicalization unit tests from completing successfully. (Specifically, it normalizes numeric character references such as 
 in attributes to spaces.). There's no easy way to work around this.
Element.insertChild(String, int) now throws a NullPointerException if the first argument is null
Fixed a nasty bug in Document's copy constructor that caused the prolog and epilog of a document not to be copied.
Disallowed fragment identifiers in system literal URI references, per conformance with the XML spec
Fixed several bugs involving the handling of notation and unparsed entity declarations in the internal DTD subset
Added unit tests for the internal DTD subset.
Fixed build.xml to point at Xerces properly. Consequently "build testui" now works
The API documentation for the nu.xom.samples package is no longer bundled with the main JavaDoc, to indicate that this is not really a part of the public API. For the moment, if you want this you'll have to build it yourself, though it's not very useful. I have not spent a lot of effort on the comments in the samples package.

1.0d13

The primary focus of this release is memory. I've done extensive profiling of the memory used by XOM, plugged memory leaks, and rearchitected classes to use less memory. The Element class has fewer fields than before and uses lazy initialization so many complex fields are null until and unless they're actually used. With this release XOM programs should use less than half the memory they used previously. I now have a rough estimate that for large (a hundred kilobytes or more), primarily ASCII-range XML documents encoded in UTF-8, the corresponding XOM Document object is five to six times the size of the input XML. Less complex documents without attributes or namespaces are likely to be smaller than documents of the same physical size with attributes and namespaces. If the original document is encoded in UTF-16, the size difference is likely to be more like 2 to 3 times.

Measurements are currently showing that almost all the space is taken up by strings and char arrays (mostly inside strings and string buffers). There might be a few places where I can make a nip here or a tuck there, but further large-scale memory optimization would have to look at using UTF-8 internally instead of UTF-16. (Possibly I can get away with doing this in just a couple of places like the Text class.) One area I can still explore is whether it might make sense to intern strings. Generally, the parser does this for anything read from a document, and the compiler does it for string literals; but there might still be a few opportunities here.

I've also done a little work on speed as well, though not nearly as extensive. Mostly I just picked off some low-hanging fruit the profiler made obvious. More serious work remains to be done. My inital measurements focused on document building. About 25-35% of the time was eaten by the parser. Another 25-35% went into verification, the biggest chunk of which was text content. The rest was divided up into dribs and drabs of actual document building. The single biggest time waster was this method:

    private static boolean isXMLCharacter(int c) {
        
        if (c <= 0xD7FF)  {
            if (c >= 0x20) return true;
            else {
                 if (c == '\n') return true;
                 if (c == '\r') return true;
                 if (c == '\t') return true;
                 return false;
            }
        }

        if (c < 0xE000) return false;  if (c <= 0xFFFD) return true;
        if (c < 0x10000) return false;  if (c <= 0x10FFFF) return true;
        
        return false;
    }

Even small optimizations here could have a large effect, so let me know if you see any. However, I'm probably going to redesign the XOMHandler class in 1.0d14 so it bypasses verification. The assumption is the parser will have already checked all this.

There are a few API level changes in this release:

The arguments to insertChild (and checkInsertChild and checkRemoveChild) have been reversed. These methods are now:

public void insertChild(Node child, int position)
protected void checkInsertChild(Node child, int position)
protected void checkRemoveChild(Node child, int position)

The previous order just didn't feel natural to me.

The removeChild methods now return the Node they remove:

public Node removeChild(int position)
public Node removeChild(Node child)

The Builder method
```
public Document build(String document, String baseURI)
```
is now declared to throw an IOException like the other build() methods because an IOException can occur while parsing the external DTD subset.
The equals() and hashCode() methods were removed from the XSLTransform class. They're probably not necessary, and their behavior was underspecified.
Several additional methods in Element were marked final: getAttributeCount(), getNamespacePrefix(int index), removeChildren(), and getAttribute(int). Their previous non-finality was an oversight.

In addition, they're a number of small changes in behavior that don't change the API:

The serializer now recognizes the IBM037 encoding (a.k.a. CP037, EBCDIC-CP-US, EBCDIC-CP-CA, EBCDIC-CP-WA, EBCDIC-CP-NL, and CSIBM037). Note that EBCDIC is a real pain in the ass, and Java's encoders do not handle it correctly for either input or output. (NEL, 0x85, gets mapped to linefeed on input and vice versa on output.) For output, Serializer transparently uses a custom subclass of OutputStreamWriter that does handle EBCDIC correctly. EBCDIC input is still broken, at least for parsers that rely on Java to do EBCDIC-Unicode conversions.
If the serializer's line separator is set, then all line separators are changed to that separator on output. If the line separator is not explicitly set, then all line breaks in source text are preserved as is.
Improved unit testing for the Serializer and Builder
The public and system IDs of DocType can now be the empty string, in conformance with the XML spec.

Finally, there were a number of small bug fixes, and lots of code cleanups throughout. The most significant bug fix involved setting or changing the namespace URI of XHTML elements (and other elements that use the default namespace).

1.0d12

This release removes the insertBefore insertAfter methods from ParentNode because:

They're redundant with other methods
They don't really fit into XOM's indexed based access style
Experience has shown they're not commonly used
I'd rather hit 1.0 with too few methods than too many. It's easier to add a method in the future than to take it away.

However, if anyone howls too loudly about this, I can probably be convinced to put them back in.

This release also fixes a bug that arose when removing the namespace from an element that had attributes, such as might occur when converting XHTML to plain vanilla HTML.

1.0d11

The new feature in this release is an ANT build file. This should make it much easier to compile XOM from source. ANT is not included though. You'll have to download and install it separately.

There are no API-level changes in this release. All code that ran before should still run. This release does fix three assorted bugs reported by users:

Worked around a bug in later versions of Xerces that don't like null entityresolvers
Allow base URIs to contain % escapes
Fixed a bug that throws NullPointerException when serializing documents without a base URI with preserveBaseURI set

Not surprisingly these all appeared in the Builder and Serializer classes, which out of all the classes in XOM are the least well-covered by unit tests. I've expanded the unit tests to catch these and related bugs. The unit tests all pass, assuming you use a non-buggy SAX2 parser. However, if you run the JUnit GUI from the ANT build file, some confusing class loader issues cause the more-buggy Crimson to be loaded instead of the less-buggy Xerces. This breaks four unit tests. Everything should pass if you run the tests directly instead of from ANT. (That is, type "java -Xmx96m junit.swingui.TestRunner nu.xom.tests.XOMTests" instead of "ant testui".) If anyone can explain to me how I might fix this, I'd appreciate it.

1.0d10

This release fixes various bugs in namespaces, and makes one API change. The declareNamespace method is once again addNamespaceDeclaration.

Under the hood, however, there are much more significant changes in namespace handling, and these are likely to break some existing applications. In particular,

The namespace prefix of an element in the default namespace (including no namespace at all) is now the empty string, not null.
getNamespaceDeclarationCount now counts all the local namespaces of the element; not just additional namespace declarations. It has at least one entry for the namespace of the element (even if the element is in no namespace), one namespace for each attribute in a namespace, and one namespace for each additional namespace declaration. However, namespaces used multiple times are only counted once. Namespaces in-scope from an ancestor but not directly used on the element are not included. getNamespacePrefix(int i) iterates across this list of local namespaces. Chances are all code that calls either of these two methods will need to be rewritten.
getNamespacePrefix("") should now always return the default namespace in scope. If no default namespace is in scope it returns the empty string, not null.

1.0d9

Removed vestigial getNextSibling() and getPreviousSibling() methods from Document. These should have been removed earlier.

In Comment:

Renamed check to checkValue
Renamed setData to setValue

In ProcessingInstruction class:

Renamed checkData to checkValue
Renamed setData to setValue

In Text:

Renamed check to checkValue
Renamed setData to setValue

In ParentNode:

Renamed checkRemove to checkRemoveChild for symmetry with checkInsertChild
Moved these two methods down into Element:

public final void appendChild(String text)
public final void insertChild(String text, int position)

Fixed Builder bug that prevented parsing File objects whose filenames contained spaces and other non-URL legal characters

Fixed equals() method in Attribute.Type to work in mutliclassloader environments

Corrected usage instructions in samples programs to include the package name

Added checks on values of xml:base attributes that they are legal IRIs. Mainly this involves checking the hex escaping.

1.0d8

XSLT works (modulo some obscure bugs in handling the undeclaration of the default namespace. I need to get some clarification on the proper behavior of SAX processors to fix this.) The TrAX XOMSource and XOMResult classes are not yet public because I'm still thinking about the proper API for these, but you can use the XSLTransform class for most use-cases. You'll need a TrAX compliant XSLT engine such as Saxon or Xalan-J 2.4 somewhere in your classpath to use this.

It is now possible to undeclare the default namespace on a prefixed element by passing the empty string as the prefix and URI to declareNamespace().

1.0d7

Added constraint that an element cannot have two attributes with the same local name and same namespace URI, but different prefixes.

Changed automatic attribute replacement to depend on local name and namespace URI and never on qualified name alone.

Removed the getFirstChild(), getPreviousSibling(), and getNextSibling() methods from Node. These really didn't fit the XOM model of indexed access, and were slower than the indexed equivalents.

Added indexOf() method to ParentNode that returns the position of a given node within its parent, or -1 if the node is not a child of this ParentNode. This is helpful for those few cases where you do need to identify a node's sibling.

public int indexOf(Node child)

Spell checked the API documentation

Moved XOMResult into the nu.xom.transform package. XSLT still doesn't work, but it's a little closer to working.

1.0d6

This release makes very limited backwards incompatible changes to the API. (A few formerly public methods in Serializer are now protected.) Almost all code that previously compiled and ran with 1.0d4 and 1.0d5, should still compile and run. New features in the API in this release include:

Namespace URIs must now be absolute URI references
Element.toXML now generates empty-element tags for empty elements
Added a nu.xom.xincluder package to provide XInclude support The samples package includes a driver program that uses this to resolve XIncludes in existing documents.
Added a nu.xom.canonical package to provide Canonical XML serialization. The samples package includes a driver program that can canonicalize documents.

Serializer has four new protected methods to provide subclasses with more access to the underlying OutputStream:

protected final void writePCDATA(java.lang.String text) throws IOException
protected final void writeAttributeValue(java.lang.String value) throws IOException
protected final void writeMarkup(java.lang.String text) throws IOException
protected final void breakLine() throws IOException

In addition, several bugs were fixed:

Fixed TextWriter bug that prevented the line separator from being changed
Fixed a bug that allowed the namespace URI of a prefixed element to be changed to the empty string.
Fixed a bug that allowed the prefix of an element to be changed to something that conflicts with one of its attributes or additional namespace declarations
Fixed a bug that prevented the detach() method from working on leaf nodes
Fixed a bug pointed out by Laurent Bihanic in getNamespaceURI(String prefix) that failed to return namespace URIs from more than one level up in the hierarchy
Fixed a cosmetic bug in the handling of nbsp in ISO-8859-11 Thai
Relative URLs in system identifiers for DTDs are now resolved against the base URI of the document specified in the builder instead of the current working directory.

1.0d5

This release makes no backwards incompatible changes to the API. All code that previously compiled and ran with 1.0d4, should still compile and run. New features in the API in this release include:

I've added getName(), equals(), hashCode(), and toString() methods to the Attribute.Type inner class. Environments with multiple class loaders should use the equals() method instead of direct equality comparison.
I added a new build method to Builder that builds a XOM Document from a java.io.File.
I added two more build methods to Builder that allow the base URI to be specified when building from a Reader or an InputStream.
I added an experimental build method to Builder that builds a XOM Document directly from a String containing well-formed XML.
I cleaned up the internal code in Builder substantially by refactoring duplicate code into private methods.
I fixed a bug that was preventing the default XMLReader from being loaded in some circumstances
Serializer now supports all defined ISO-8859 character sets, including:
- ISO-8859-1
- ISO-8859-2
- ISO-8859-3
- ISO-8859-4
- ISO-8859-5
- ISO-8859-6
- ISO-8859-7
- ISO-8859-8
- ISO-8859-9
- ISO-8859-10
- ISO-8859-11
- ISO-8859-13
- ISO-8859-14
- ISO-8859-15
- ISO-8859-16
Note that although XOM supports them, not all Java virtual machines do.
Serializer now matches character set names case-insensitively as suggested by the XML specification.
Fixed a bug in UnicodeWriter that was preventing reserved characters such as & and < from being escaped when the encoding was some variant of Unicode. (This is more evidence that premature optimization is the root of all evil. I just couldn't resist an obvious optimization in the UnicodeWriter class, and it came back to bite me in the ass.)
Fixed a cosmetic bug that added unnecessary xmlns="" declarations on root elements by Serializer and toXML in Element
.
Fixed incorrect hexadecimal escape sequences generated by TextWriter

1.0d4

The major addition in 1.0d4 are methods to get and set the base URI of a node. You can invoke getBaseURI from any Node object to retrieve the URL against which relative URLs in that Node should be resolved. This is calculated in keeping with XML Base. That is, if an xml:base attribute is in scope its value is used. Otherwise, the URI of the entity in which the Node appears is loaded. You can change the underlying URI of the entity using the setBaseURI method in ParentNode. When a document is built, the parser fills in the base URI for each node. This is stored separately from xml:base attributes, which are not treated differently than any other attribute. When a document is serialized, you may request that the serializer fill in extra xml:base attributes not present in the infoset to preserve the underlying base URIs. However, since this is a structural change to the document, this feature is turned off by default.

Other API level changes include:

The Attributes and Namespaces classes are no longer part of the public API. Instead the Element class has these four public methods:

public Attribute getAttribute(int index) public int getAttributeCount() public int getNamespaceDeclarationCount() public void getNamespacePrefix(int i)
getStringForm has been renamed toXML
readAttribute has been renamed getAttributeValue
addAdditionalNamespace has been renamed declareNamespace
The removeChildren method has been moved from ParentNode into Element because it's impossible to remove all the children of a Document.
The following protected methods allow subclasses to monitor insertions and deletions from subclasses of Element and Document:

public void checkInsertChild()
public void checkRemoveChild()
The following protected methods allow subclasses of Element to monitor namespace declarations:

public void checkAddNamespaceDeclaration()
public void checkRemoveNamespaceDeclaration()
The following protected methods allow subclasses of Element to monitor changes of local name, namespace prefix, and namespace URI:

public void checkLocalName()
public void checkNamespacePrefix()
public void checkNamespaceURI()
The missing write(DocType) method has been added to Serializer. This fixes a nasty infinite recursion when serializing documents with document type declarations.

In addition several bugs were fixed, the JavaDoc was further cleaned up and improved, and more than a dozen new unit tests were added.

1.0d3

The major change in 1.0d3 is that the TreeNode class has been replaced by the ParentNode class. The only immediate subclasses of ParentNode are Element and Document. Attribute is the only immediate subclass of Node The other four node types are subclasses of LeafNode which is a subclass of Node. All navigation methods—getChild, getNextSibling, getParent, etc.—are now in Node. All insertion and deletion methods—appendChild, insertChild, removeChild, etc.—are only available in ParentNode, that is, Document and Element. Other API-level changes since 1.0d2 include:

add is now addAttribute
LeafNode is public

I also spent a lot of time improving the JavaDoc.

1.0d2

I've posted 1.0d2 to fix the first bugs discovered, clean up the source code, and make a few changes to method names that seemed wise. API-level changes since Tuesday night include:

readAttribute is now getAttributeValue
howManyChildren is now getChildCount

[ Cafe con Leche | Cafe au Lait ]