gsnedders

The editor recommends eating pasta while implementing this specification.

Making Habari Use XHTML

Tags: , , , , March 13, 2008 (2 comments)

Disclaimer: this post does not in any shape or form recommend using XHTML served as text/html. When I use "XHTML", I am referring to content that identifies itself as XHTML through its MIME type (i.e.,

application/xhtml+xml

); likewise, when I use "HTML", I am referring to content that identifies itself as HTML through its MIME type (i.e.,

text/html

). What DOCTYPE you use is entirely irrelevant.

Currently, Habari uses HTML 4.01 Strict, as ensuring content is well-formed XML is hard, especially once you take any user input. This poses all kinds of issues to any product that wants to ensure its output is well-formed XML, especially if you don't want to plain reject any input that isn't well-formed XML (by throwing all input through a XML parser on the server) — very, very, very few comments are well-formed, and posts often aren't either. Any product that does output XML should do so without ever showing an error to the user. The issues with making Habari output XML are as follows:

  • As everything is dealt with as (binary) UTF-8 strings internally, as PHP lacks any real Unicode support, and as such it becomes hard to ensure you don't use characters that cause well-formedness errors in XML (i.e., the surrogate blocks, U+FFFE, and U+FFFF). Fixing this will require implementing Unicode in PHP. Here's a start.
  • The next major issue is ensuring tag names only contain valid characters, that all tags are opened and closed in order, that all attribute names and values only contain valid characters, and that no attributes occur more than once on a single tag. Due to the craziness of HTML, you really need a full HTML parser to achieve this, and then serialise a DOM you get from that into XML. The only way to achieve this without spending hundreds of hours yourself reverse-engineering IE is to follow HTML 5. This will still result in some content that cannot be serialised as XML (e.g., <span fo"o=f""a>).
  • We also have to make sure all character references are legal, all entities are defined, and there are no lone ampersands without an entity. The first of these is the easiest: we can just read the character reference (e.g., &#x26;) and check if it refers to a legal charcter. The second is harder: in a non-validating XML processor, any external subsets aren't read (e.g., any DOCTYPE isn't read, so XHTML's entities aren't known), which can lead to fatal errors due to unknown entities. As a result, we have to parse the document and convert such entities either to the actual character they represent or to a character reference: there are only five entities that exist in all XML documents: &lt;, &gt;, &amp;, &apos;, and &quot;. Lone ampersands are dealt with within the HTML parser as an unknown entity. This again requires an HTML parser.

Now, what makes any of this difficult? A lot of things don't specify their character set, so we have to guess it — we must make sure our output has a specified encoding and only contains characters in that encoding. Oh, and then there's the sheer about of code needed for a Unicode and HTML 5 implementation in PHP.

However, Habari would gain a lot more than just being able to output the content of the blog as XHTML: an HTML parser would allow Pingback 1.0 to be properly implemented; Unicode support would also allow strings to be properly shorterned (currently we just cut the string off after n-bytes and hope the Unicode decoder isn't one that throws a fatal error on an invalid sequence, but instead replaces it with something like � (U+FFFD)).

I would quite like to see Habari store everything as XML internally, as the amount of stuff it would make impossible to store is minimal (and what it couldn't store is rare non-conforming HTML). It would also allow us to make use of this eternal

cached_content

column in the posts table at long last! It would also allow Habari to internally store namespaced content, and make it possible if so desired to publish a blog as XHTML (allowing that namespaced content to be visible).

Neither WordPress nor Movable Type makes it possible to output XHTML by default. Both need extensive modification to be able to safely output XHTML. It is often still easy to break such modified installs. I'd love to see Habari being the first major blogging platform capable of outputting XHTML out of the box, and I fully intend to help get it there. I hope to take on the challenge as my Advanced Higher Computing project — with so little done (i.e., just the basis of encoding/decoding Unicode strings), there is still by far enough to do to fill up a forty-hour project.

Page:  1