Why XHTML As text/html Is Evil

Tags: April 7, 2007 (0 comments)

Time to drag this blog away from my personal life (as the last nine posts have been about, my writing now has its own site which has now been merged back into one site) and back to web development.

While a large number of web developers prefer to use XHTML served as text/html over HTML, is this really an advisable thing to do?

Under the advice from the old HTML WG, UAs should parse XHTML, when served as text/html, as HTML. This has several issues, as HTML is a SGML application, and SGML's behaviour differs in parts from XML's, for example:

<img src="test.gif" />test<img src="test.gif" />

When parsed as SGML, this produces this DOM tree:

img
@src "test.gif"
">test"

img
@src "test.gif"

">"

When the above DOM tree is rendered in a browser the first <img> and the final ">" are displayed. However, when parsed as XML, this produces the following, very different, DOM tree:

img
@src "test.gif"

"test"

img
@src "test.gif"

In complete contrast, both <img> elements are shown, with the word "test" between.

How then do real world browsers create the same DOM tree for both HTML and XHTML in the above example (thereby in the real world making it irrelevant)? HTML error handling – real world HTML implementations haven't for many years used SGML parsers, as billions of web pages would break. This error handling, however, is completely and utterly undefined, and so if you follow the specifications you end up with the SGML rendering above. UAs implement it by reverse engineering one another, mainly IE, as the predominant UA.

As this behaviour is completely undefined, can it truly be realised upon? For the most part it can; however, there are a few UAs that still use SGML parsers, but they are mostly very minor, and still make exceptions for things like the example above. It's behaviour, however inconsistent where is does exist.

Part of the work of the new HTML WG is to define a version of HTML, compatible with legacy UAs, thereby not being a delta specification, not based upon SGML, which should help to bring interoperability.

If you still think XHTML as text/html is better because the basics of the render handling is consistent between UAs, ask yourself what advantages it brings over HTML? You cannot embed other namespaces in XHTML as text/html (you can't actually have other namespaces within a strictly conforming XHTML document, even when served as application/xhtml+xml, as the DTD is a normative part of the specification, and therefore other namespaces' elements and prefixes must be defined in the DTD). Does it make it any easier to move over to serving it as application/xhtml+xml? No, as you have no guarantee that the document is well-formed, as it has been parsed by the same parser as it would be if it were HTML. As it is being parsed under the same parser, there is no advantage for XHTML.

HTML, however, has a very major advantage: the parser is written to parse HTML. It can easily be parsed without any parse errors, while using all the elements in the specification. XHTML cannot be well-formed and valid while using all the elements in the specification when served as text/html.

It's easier to create a document that can be parsed without parse errors when it's actually possible! Even for that reason alone, HTML should be used instead of XHTML.

Comments

There aren't any comments yet, but feel free to leave one regardless.

Leave a Reply