Today marks my sixteenth birthday: the age of majority in Scotland. I can now have sex and get married, which seeming I've never posted anything tagged with lust will be no issue, and will happen in no time.
Today marks my sixteenth birthday: the age of majority in Scotland. I can now have sex and get married, which seeming I've never posted anything tagged with lust will be no issue, and will happen in no time.
With the start of the 2008 F1 season just hours away, time to post some predictions:
Disclaimer: this post does not in any shape or form recommend using XHTML served as text/html. When I use "XHTML", I am referring to content that identifies itself as XHTML through its MIME type (i.e., application/xhtml+xml); likewise, when I use "HTML", I am referring to content that identifies itself as HTML through its MIME type (i.e., text/html). What DOCTYPE you use is entirely irrelevant.
Currently, Habari uses HTML 4.01 Strict, as ensuring content is well-formed XML is hard, especially once you take any user input. This poses all kinds of issues to any product that wants to ensure its output is well-formed XML, especially if you don't want to plain reject any input that isn't well-formed XML (by throwing all input through a XML parser on the server) — very, very, very few comments are well-formed, and posts often aren't either. Any product that does output XML should do so without ever showing an error to the user. The issues with making Habari output XML are as follows:
<span fo"o=f""a>).&) and check if it refers to a legal charcter. The second is harder: in a non-validating XML processor, any external subsets aren't read (e.g., any DOCTYPE isn't read, so XHTML's entities aren't known), which can lead to fatal errors due to unknown entities. As a result, we have to parse the document and convert such entities either to the actual character they represent or to a character reference: there are only five entities that exist in all XML documents: <, >, &, ', and ". Lone ampersands are dealt with within the HTML parser as an unknown entity. This again requires an HTML parser.Now, what makes any of this difficult? A lot of things don't specify their character set, so we have to guess it — we must make sure our output has a specified encoding and only contains characters in that encoding. Oh, and then there's the sheer about of code needed for a Unicode and HTML 5 implementation in PHP.
However, Habari would gain a lot more than just being able to output the content of the blog as XHTML: an HTML parser would allow Pingback 1.0 to be properly implemented; Unicode support would also allow strings to be properly shorterned (currently we just cut the string off after n-bytes and hope the Unicode decoder isn't one that throws a fatal error on an invalid sequence, but instead replaces it with something like � (U+FFFD)).
I would quite like to see Habari store everything as XML internally, as the amount of stuff it would make impossible to store is minimal (and what it couldn't store is rare non-conforming HTML). It would also allow us to make use of this eternal cached_content column in the posts table at long last! It would also allow Habari to internally store namespaced content, and make it possible if so desired to publish a blog as XHTML (allowing that namespaced content to be visible).
Neither WordPress nor Movable Type makes it possible to output XHTML by default. Both need extensive modification to be able to safely output XHTML. It is often still easy to break such modified installs. I'd love to see Habari being the first major blogging platform capable of outputting XHTML out of the box, and I fully intend to help get it there. I hope to take on the challenge as my Advanced Higher Computing project — with so little done (i.e., just the basis of encoding/decoding Unicode strings), there is still by far enough to do to fill up a forty-hour project.
Over the past month since the IE team announced its intentions to have an opt-in into IE8 Standards Mode, both them and Microsoft as a whole, have got a huge amount of anger directed at them.
Taking a quick look through the comments on the IE blog again, I believe a lot of this anger is quite irrational: a lot of what is being said would make sense if standards compliant sites were the majority of sites — they aren't: they are a tiny minority (i.e., a single figure percentage at most). Breaking almost every non-compliant website would cause huge issues: someone without a technical background will likely conclude that IE8 breaks sites that work fine in IE7, therefore IE8 is broken. The web is based around the fact that anyone can create a website, and increasing the barrier of entry by breaking existing sites would be a good way to make sure your product isn't used. The end user couldn't give a damn about whether IE supports HTML 4.01 or CSS 2.0 (the current recommendations), they care that their favourite site works. If it doesn't, they complain to Microsoft. Quite frankly, standards be damned. Conformance requirements are wholly irrelevant when the real-world relies on behaviour that is at odds with what is in the standard — if reality is so reliant on UAs doing something different to what is specified, the specification needs to be fixed, not the other way around.
Switches aren't something new, neither is opting-in (the DOCTYPE switch is exactly that, an opt-in into standards mode), nor is a switch first used by Microsoft (IE5 for Mac was the first browser to ship with DOCTYPE switching in 2000, though it was hinted at on the Mozilla mailing lists back in 1998, although nothing happened about it in Mozilla until after IE5 for Mac shipped). Just using a W3C DOCTYPE isn't enough to be standards mode — transitional and frameset DOCTYPEs can and do cause quirks mode to be used. It requires a subset of what is conformant to trigger standards mode. Likewise, requiring a meta element is requiring a subset of HTML (albeit, this time requiring a specific element/attribute combination instead of requiring a specific DOCTYPE public/system identifier combination). Yet, DOCTYPE switching is accepted almost universally within standards communities, but a proposed new switch is not.
Now what actually causes the issue that necessitates a new switch? With development of IE having been stopped after the release of IE6, a large number of sites have grown up that are dependant on IE being broken forever often at the same time as supporting other browsers fine. What these sites depend upon vary from relying that the DOM isn't a tree to using conditional comments that don't specify a version. Netscape won the scripting wars (with JavaScript) and Microsoft won the browser wars (with IE). With the wars being won by different products, scripting is incompatible. There were far less compatible scripting engines than JScript, it's just they died off.
Why is this an issue? IE can't overnight move from JScript to ECMAScript (a standard which is based upon JavaScript) without breaking a large number of sites which either won't be updated at all or in a timely manner. And it's not just small non-compliant sites that have issues, major scripting products like Prototype have browser specific hacks all over the place (and not just for IE, but other browsers too, including the standards community's beloved Firefox), which with any major scripting changes will break horribly. Just take a look at some of the users of Prototype to get an idea of how many major sites could easily break with a new version of IE. It's not that these major sites won't be updated ever, but most corporations aren't using the latest versions of everything on their website, even after a new browser release. If you were in the IE team's position, would you really want to break so many major sites? It's commercial suicide. The pages that won't be changed are split into those that cannot be changed: those that physically cannot be changed (e.g., HTML/CHM help files on CDs), and those that are unmaintained. It's not just intranets where content that causes issues is.
There's no point of blaming the IE team for getting the web into the situation that it's in: a large number of issues predate Trident, back when IE used Mosaic, and other issues are a result of multiple things, not least things being invented during the browser wars (DOM and CSS most notably) and browser vendors trying to extend them to outdo the opposition (Netscape is just as guilty of this), but also because of the vagueness of how they are specified (or, in some cases, the total lack of specification).
As the issue is with pages that specifically hack for IE (and those hacks now cause issues), this doesn't mean other browsers will need to spend their time reverse-engineering even more modes — the issue is mostly with sites that rely on IE's own quirky behaviour while at the same time relying on other browsers working fine. It is a non-issue for everyone apart from Microsoft.
That all said, the proposed switch, X-UA-Compatible, is, in my option, a horrible solution to the problem at hand. We need another switch that behaves likes the DOCTYPE switch (i.e., it doesn't limit you to a single browser and a single version of that browser). I admit, I'm not sure how to implement this (though I think a version attribute on the MIME type, like Gecko's JavaScript 1.7 solution, is a good basis for ideas). It will need to be opt-in, like the DOCTYPE switch, without question, and also not conflict with Gecko's JavaScript version switch. Let me wave a banner saying I use CSS 2.1 and ECMA-262 3rd edition (the latest version of ECMAScript), and I'll be happy.
Only three other alternatives for switches have come up repeatedly:
application/xhtml+xml and not XHTML expected to be parsed as HTML), which falls down as soon as you look at its backwards compatibility, and the difficulty of using XML in the real world. That said, there's no question in my mind that XHTML needs to trigger "edge" (as it does everywhere else). One thing that I think IE8 needs to avoid having is actually shipping with multiple engines: everything should be doable with switches within the engine (this is already done everywhere else). The canvas element works just as well in quirks mode as it does in standards mode in Gecko and WebKit. There's no reason why this can't be the case in IE8 too: the IE team reverse engineered Netscape better than Netscape could themselves, so there's no reason why they can't do this to themselves, and create a compatible engine.
As I see it, there is only one way for IE to move forwards with its standards support, which is with an opt-in switch. I feel that IE is in an identical position to what it was in when the DOCTYPE switch was introduced: somehow, backwards compatibility must be kept while allowing standards compliant sites to work as intended.
What would be nice for many in the web development community would be for multiple releases of the IE8 betas be made: one which switches into IE8 Standards Mode on encountering a standards-mode DOCTYPE (i.e., replacing the current standards mode), and one that only switches into it on a new switch. Let people see how widespread site breakage is.
Just over a month ago, there were all kinds of accusations going around about Apple, Microsoft, and Nokia paying to get Ogg/Theora/Vorbis removed from the HTML 5 draft. However, as someone within both the W3C HTML WG and the WHATWG I find these accusations wholly unbelievable.
Unquestionably the first issue I'll have to clear up is why they would be removed otherwise: without the support of Apple, Microsoft, and Nokia (the former two who make up a large percentage of the web browser marketshare) any mandated format for the media elements (audio and video) is pointless. The entire point of mandating a format is to have a common format that can relied upon (and, as such, it really needs to be a MUST not a SHOULD) — if major companies consider Theora a large financial risk they simply will not implement the specification, and what is a specification worth if it does not create interoperability? Without interoperability, there is no difference from the specification not existing at all.
But why is Theora a large financial risk, and why only for major companies?
"But I read that Theora was patent-free! Surely this is all irrelevant‽" has been asked all too many times — it is known that Theora isn't patent-free (though the known ones held by On2 have got an irrevocable royalty-free license), and it is not known if there are any submarine patents (by their very definition (i.e., patents that aren't published for a long time after they are applied for) they cannot be known about).
If video codecs are patent risks, how come companies continue to add new support for codecs such as H.264? The answer is a simple one: they will add support for new codecs when the new codec is technically superior to what is already supported or there is a significant body of already deployed content that requires support for the new codec; though many people have tried to spin this issue by omitting is technically superior
to make an argument that they can't under their argument add any new codecs ever (as the last clause is mainly for legacy codecs, and as this post later explains it is highly unlikely to be used to add any new codecs nowadays).
Many have argued that Wikimedia's use of Theora is a significant body of content, however, compared with the number of videos on the web (undoubtedly predominantly pornography) this is wholly insignificant (it is extremely unlikely to be as large as even 0.001% of video on the web).
So, to get support for a codec such as Theora we need people to distribute data in Theora. There are major issues with this: most people will only use codecs that have encoders that are bundled with software, and that are widely supported. We don't get encoders for codecs unless they are technically superior to what is currently available or are widely supported by decoders; and we don't get decoders for codecs unless they are technically superior to what is currently available or there is a significant body of already deployed content that requires support for the new codecs (which doesn't exist until we get support in the encoders which we don't get until we get support in the decoders, ad infinitum). Realistically, this means the only way to get support for a new codec (and not one that is already widely deployed) is by it being technically superior to what is already used.
What can we use as a base video codec, though? This is a very hard question, and not something that can easily be answered. It is, however, clear that Theora is not an option (nor is any MPEG standard, as these all have known non-royalty-free patents). Arguing that Theora should be put back in is pointless — it's not what we want in a specification. We want interoperability. There are a large number of people both within the W3C and outside of it looking for a suitable codec. To quote what the specification currently lists as requirements:
- We need a codec that is known to not require per-unit or per-distributor licensing,
- That is compatible with the open source development model,
- That is of sufficient quality as to be usable,
- And that is not an additional submarine patent risk for large companies.
Any suggestions that meet all four of these requirements would be warmly welcomed by the HTML WG.
However, I have kept Ogg/Vorbis out of the entire body of this post. Neither of these is an issue. Both already have major companies shipping them in bulk (Microsoft, Epic, and Rockstar all have used Ogg/Vorbis in major products). That said, it cannot be said for certain that these can be used with the video codec (some video codecs require certain containers or audio codecs). Why remove these, though? It makes little sense to define what MUST be used for some of the media elements but not for others.