Making Habari Use XHTML

Tags: , , , , March 12, 2008 (2 comments)

Disclaimer: this post does not in any shape or form recommend using XHTML served as text/html. When I use "XHTML", I am referring to content that identifies itself as XHTML through its MIME type (i.e., application/xhtml+xml); likewise, when I use "HTML", I am referring to content that identifies itself as HTML through its MIME type (i.e., text/html). What DOCTYPE you use is entirely irrelevant.

Currently, Habari uses HTML 4.01 Strict, as ensuring content is well-formed XML is hard, especially once you take any user input. This poses all kinds of issues to any product that wants to ensure its output is well-formed XML, especially if you don't want to plain reject any input that isn't well-formed XML (by throwing all input through a XML parser on the server) — very, very, very few comments are well-formed, and posts often aren't either. Any product that does output XML should do so without ever showing an error to the user. The issues with making Habari output XML are as follows:

  • As everything is dealt with as (binary) UTF-8 strings internally, as PHP lacks any real Unicode support, and as such it becomes hard to ensure you don't use characters that cause well-formedness errors in XML (i.e., the surrogate blocks, U+FFFE, and U+FFFF). Fixing this will require implementing Unicode in PHP. Here's a start.
  • The next major issue is ensuring tag names only contain valid characters, that all tags are opened and closed in order, that all attribute names and values only contain valid characters, and that no attributes occur more than once on a single tag. Due to the craziness of HTML, you really need a full HTML parser to achieve this, and then serialise a DOM you get from that into XML. The only way to achieve this without spending hundreds of hours yourself reverse-engineering IE is to follow HTML 5. This will still result in some content that cannot be serialised as XML (e.g., <span fo"o=f""a>).
  • We also have to make sure all character references are legal, all entities are defined, and there are no lone ampersands without an entity. The first of these is the easiest: we can just read the character reference (e.g., &#x26;) and check if it refers to a legal charcter. The second is harder: in a non-validating XML processor, any external subsets aren't read (e.g., any DOCTYPE isn't read, so XHTML's entities aren't known), which can lead to fatal errors due to unknown entities. As a result, we have to parse the document and convert such entities either to the actual character they represent or to a character reference: there are only five entities that exist in all XML documents: &lt;, &gt;, &amp;, &apos;, and &quot;. Lone ampersands are dealt with within the HTML parser as an unknown entity. This again requires an HTML parser.

Now, what makes any of this difficult? A lot of things don't specify their character set, so we have to guess it — we must make sure our output has a specified encoding and only contains characters in that encoding. Oh, and then there's the sheer about of code needed for a Unicode and HTML 5 implementation in PHP.

However, Habari would gain a lot more than just being able to output the content of the blog as XHTML: an HTML parser would allow Pingback 1.0 to be properly implemented; Unicode support would also allow strings to be properly shorterned (currently we just cut the string off after n-bytes and hope the Unicode decoder isn't one that throws a fatal error on an invalid sequence, but instead replaces it with something like � (U+FFFD)).

I would quite like to see Habari store everything as XML internally, as the amount of stuff it would make impossible to store is minimal (and what it couldn't store is rare non-conforming HTML). It would also allow us to make use of this eternal cached_content column in the posts table at long last! It would also allow Habari to internally store namespaced content, and make it possible if so desired to publish a blog as XHTML (allowing that namespaced content to be visible).

Neither WordPress nor Movable Type makes it possible to output XHTML by default. Both need extensive modification to be able to safely output XHTML. It is often still easy to break such modified installs. I'd love to see Habari being the first major blogging platform capable of outputting XHTML out of the box, and I fully intend to help get it there. I hope to take on the challenge as my Advanced Higher Computing project — with so little done (i.e., just the basis of encoding/decoding Unicode strings), there is still by far enough to do to fill up a forty-hour project.

Leaving WordPress

Tags: , , October 11, 2007 (1 comment)

WordPress

While I've had various things to do with the WordPress community back as far as 2003, it finally time to completely part ways: I have become less and less happy with the direction that WordPress has been taking over the past year, though the issues are really far deeper and go back, in some cases, to the very origins of the project.

Ever since the project started, Matt Mullenweg has progressively become more and more protective of the source base, especially since WordPress.com launched, making it progressively harder to get any code into the repository. For example, having discussed a bug at great length with a number of people including Matt, and written a patch for it, taking a great deal of my time, Matt then changed his mind and on his own decided that it wouldn't get committed. There have been plenty of occurrences of Matt single-handedly making decisions — if there is a secret cabal where decisions are really made can you at least stop claiming that the development community has any say in anything?

The entire project seems to now be run in such a way so that Automattic has software with as many features as possible to profit from, with little regard for any bugs or any features that are mostly invincible to the end-user.

It took over two years for WordPress to have Atom 1.0 support added. Why? Matt wasn't happy with the patches, despite there being plenty around for a long time which were bug-free. When Atom 1.0 support was added, for some reason or another, the comment feeds use a different pipeline (actually, they use a different set of string-concatenation strings) and was outputting absolute rubbish — |link|@content instead of @href; |link|@type contained the blog name, not a MIME type; |updated| and |published| contained RFC 822 dates, instead of RFC 3339 dates (which are totally and utterly different). Also, it is possible to get invalid bytes into the feeds, which under XML is a well-formness error, and must therefore cause a fatal error (luckily for WordPress, out of the major browsers, only IE/Win actually obeys this). So, when a patch finally gets committed it is too much effort to visit the Feed Validator before release? Why be so overly protective of the source base if you let such rubbish in anyway? Also, as one final note on the subject of Atom 1.0 support, Matt said, this is a enhancement, not a bug, despite Atom 0.3 being an obsolete I-D, a series made publicly available for comments before publication, and is liable to totally change. Any change of a draft is a bug in any older implementation.

While some may argue that the above is merely an enhancement, surely nobody would argue that a high priority critical bug that caused IRIs to be stripped should be fixed in a plugin? This really seems to be the case.

Furthermore, I have serious issues with WordPress's focus on aesthetics, web standards, and usability. If there is a focus on aesthetics, why is Kubrick the default theme, even though there are far better templates available? If there is a focus on web standards, why did I even write the above paragraph? Why does by default WordPress use a transitional DOCTYPE? Is WordPress still transitioning to standards? Why is WordPress.org served as it SHOULD NOT be (i.e., XHTML 1.1 served as text/html — even XHTML 1.0 would be better!)?

Lastly, the thing that finally made me think that I should totally get off WordPress (my blog had for a while been running the security-fixes-only 2.0 branch, to avoid the chaos and the insane bugginess of later releases) was that after having decided to have 120 day release cycles, including one month after a feature freeze, Matt went and commited what is one of the largest changes in several years to WordPress within the feature freeze, causing the entire release to be pushed back. This, IMO is the ultimate example of Matt focusing on maximising WordPress.com features and profits without caring about affects it might have to the open-source project.

Habari

So, where does the future lie? Habari. Habari is developed under the Apache meritocracy model, so it should be far harder for some benevolent person in a position of power turning against the wishes of the community as a whole. Also, Habari is developed for modern web hosting environments, and makes use of open standards, and is not afraid to go against the de-facto standards of the day, such as XHTML superseding HTML, provided reasoning is given. Due to Habari's design, using PDO prepared statements, and an XML serialiser for what XML is outputted, it is far less fragile than something patched together over the years everytime something breaks such as WordPress.

Also, due to Habari's organisation, it is possible for someone completely new to the community to step in and say something that will totally change the direction of the project (after, as with many things, it has been voted upon), something that cannot be done by people who have been around WordPress for a long time, yet alone someone who is completely new.

Many of the largest contributors to Habari were WordPress contributors previously — some of them well known within the community — who left WordPress often for reasons similar to the above. As far as I can see, since a number of them walked away from WordPress, WordPress has gotten progressively worse and buggier. These are people who experience with working with blogs. They know what past mistakes have been made. Above all, they are willing to change their opinions if you give them good reasoning.

Redesigned

For the first time since 2005 (when I got through three designs in ten months), my site has been redesigned. Unlike the earlier designs that were thrown together relatively quickly, this one has taken over a year: minimalism, when done properly, is no easier than something more visually complex. The colour scheme has changed around five times throughout the course of the design, and the number of images has varied from zero to two (yes, that's the most complex the design got).

More will eventually be posted about the design, and the inspiration behind yet. Alas, the future will hold different challenges to those in the past, and any future designs will therefore be different to what this is.

Page:  1