gsnedders

Conscious stream of thought at 4am. The quality of the English reflects that.

Sixteen

Tags: April 20, 2008 (6 comments)

Today marks my sixteenth birthday: the age of majority in Scotland. I can now have sex and get married, which seeming I've never posted anything tagged with lust will be no issue, and will happen in no time.

2008 F1 Predictions

Tags: March 14, 2008 (0 comments)

With the start of the 2008 F1 season just hours away, time to post some predictions:

  • Kimi Räikkönen will win a second world drivers' championship, with Felipe Massa second, Heikki Kovalainen third, and Lewis Hamilton fourth.
  • Sébastien Bourdais will be the most successful rookie.
  • Ferrari will win the constructors' championship, with McLaren second, and BMW third.
  • Ferrari and McLaren will continue to absolutely dominate, and will be massively clear of third place in the constructors' championship.
  • Despite Ferrari and McLaren being comparatively close, Ferrari will still manage to get a fair lead from McLaren and win the constructors' championship at, at the very latest, the third last race (at Fuji).
  • BMW, despite finishing third in the constructors' championship, will fail to win a race.
  • The naïveté of Jenson Button buying himself out of his Williams' contract will continue with Williams outdoing Honda again.

Making Habari Use XHTML

Tags: , , , , March 13, 2008 (2 comments)

Disclaimer: this post does not in any shape or form recommend using XHTML served as text/html. When I use "XHTML", I am referring to content that identifies itself as XHTML through its MIME type (i.e.,

application/xhtml+xml

); likewise, when I use "HTML", I am referring to content that identifies itself as HTML through its MIME type (i.e.,

text/html

). What DOCTYPE you use is entirely irrelevant.

Currently, Habari uses HTML 4.01 Strict, as ensuring content is well-formed XML is hard, especially once you take any user input. This poses all kinds of issues to any product that wants to ensure its output is well-formed XML, especially if you don't want to plain reject any input that isn't well-formed XML (by throwing all input through a XML parser on the server) — very, very, very few comments are well-formed, and posts often aren't either. Any product that does output XML should do so without ever showing an error to the user. The issues with making Habari output XML are as follows:

  • As everything is dealt with as (binary) UTF-8 strings internally, as PHP lacks any real Unicode support, and as such it becomes hard to ensure you don't use characters that cause well-formedness errors in XML (i.e., the surrogate blocks, U+FFFE, and U+FFFF). Fixing this will require implementing Unicode in PHP. Here's a start.
  • The next major issue is ensuring tag names only contain valid characters, that all tags are opened and closed in order, that all attribute names and values only contain valid characters, and that no attributes occur more than once on a single tag. Due to the craziness of HTML, you really need a full HTML parser to achieve this, and then serialise a DOM you get from that into XML. The only way to achieve this without spending hundreds of hours yourself reverse-engineering IE is to follow HTML 5. This will still result in some content that cannot be serialised as XML (e.g., <span fo"o=f""a>).
  • We also have to make sure all character references are legal, all entities are defined, and there are no lone ampersands without an entity. The first of these is the easiest: we can just read the character reference (e.g., &#x26;) and check if it refers to a legal charcter. The second is harder: in a non-validating XML processor, any external subsets aren't read (e.g., any DOCTYPE isn't read, so XHTML's entities aren't known), which can lead to fatal errors due to unknown entities. As a result, we have to parse the document and convert such entities either to the actual character they represent or to a character reference: there are only five entities that exist in all XML documents: &lt;, &gt;, &amp;, &apos;, and &quot;. Lone ampersands are dealt with within the HTML parser as an unknown entity. This again requires an HTML parser.

Now, what makes any of this difficult? A lot of things don't specify their character set, so we have to guess it — we must make sure our output has a specified encoding and only contains characters in that encoding. Oh, and then there's the sheer about of code needed for a Unicode and HTML 5 implementation in PHP.

However, Habari would gain a lot more than just being able to output the content of the blog as XHTML: an HTML parser would allow Pingback 1.0 to be properly implemented; Unicode support would also allow strings to be properly shorterned (currently we just cut the string off after n-bytes and hope the Unicode decoder isn't one that throws a fatal error on an invalid sequence, but instead replaces it with something like � (U+FFFD)).

I would quite like to see Habari store everything as XML internally, as the amount of stuff it would make impossible to store is minimal (and what it couldn't store is rare non-conforming HTML). It would also allow us to make use of this eternal

cached_content

column in the posts table at long last! It would also allow Habari to internally store namespaced content, and make it possible if so desired to publish a blog as XHTML (allowing that namespaced content to be visible).

Neither WordPress nor Movable Type makes it possible to output XHTML by default. Both need extensive modification to be able to safely output XHTML. It is often still easy to break such modified installs. I'd love to see Habari being the first major blogging platform capable of outputting XHTML out of the box, and I fully intend to help get it there. I hope to take on the challenge as my Advanced Higher Computing project — with so little done (i.e., just the basis of encoding/decoding Unicode strings), there is still by far enough to do to fill up a forty-hour project.

The Need For A Switch in IE8

Tags: , , February 22, 2008 (0 comments)

Over the past month since the IE team announced its intentions to have an opt-in into IE8 Standards Mode, both them and Microsoft as a whole, have got a huge amount of anger directed at them.

Taking a quick look through the comments on the IE blog again, I believe a lot of this anger is quite irrational: a lot of what is being said would make sense if standards compliant sites were the majority of sites — they aren't: they are a tiny minority (i.e., a single figure percentage at most). Breaking almost every non-compliant website would cause huge issues: someone without a technical background will likely conclude that IE8 breaks sites that work fine in IE7, therefore IE8 is broken. The web is based around the fact that anyone can create a website, and increasing the barrier of entry by breaking existing sites would be a good way to make sure your product isn't used. The end user couldn't give a damn about whether IE supports HTML 4.01 or CSS 2.0 (the current recommendations), they care that their favourite site works. If it doesn't, they complain to Microsoft. Quite frankly, standards be damned. Conformance requirements are wholly irrelevant when the real-world relies on behaviour that is at odds with what is in the standard — if reality is so reliant on UAs doing something different to what is specified, the specification needs to be fixed, not the other way around.

The Age Of Opt-In Switches

Switches aren't something new, neither is opting-in (the DOCTYPE switch is exactly that, an opt-in into standards mode), nor is a switch first used by Microsoft (IE5 for Mac was the first browser to ship with DOCTYPE switching in 2000, though it was hinted at on the Mozilla mailing lists back in 1998, although nothing happened about it in Mozilla until after IE5 for Mac shipped). Just using a W3C DOCTYPE isn't enough to be standards mode — transitional and frameset DOCTYPEs can and do cause quirks mode to be used. It requires a subset of what is conformant to trigger standards mode. Likewise, requiring a meta element is requiring a subset of HTML (albeit, this time requiring a specific element/attribute combination instead of requiring a specific DOCTYPE public/system identifier combination). Yet, DOCTYPE switching is accepted almost universally within standards communities, but a proposed new switch is not.

Moving Forward From A Hiatus

Now what actually causes the issue that necessitates a new switch? With development of IE having been stopped after the release of IE6, a large number of sites have grown up that are dependant on IE being broken forever often at the same time as supporting other browsers fine. What these sites depend upon vary from relying that the DOM isn't a tree to using conditional comments that don't specify a version. Netscape won the scripting wars (with JavaScript) and Microsoft won the browser wars (with IE). With the wars being won by different products, scripting is incompatible. There were far less compatible scripting engines than JScript, it's just they died off.

Why is this an issue? IE can't overnight move from JScript to ECMAScript (a standard which is based upon JavaScript) without breaking a large number of sites which either won't be updated at all or in a timely manner. And it's not just small non-compliant sites that have issues, major scripting products like Prototype have browser specific hacks all over the place (and not just for IE, but other browsers too, including the standards community's beloved Firefox), which with any major scripting changes will break horribly. Just take a look at some of the users of Prototype to get an idea of how many major sites could easily break with a new version of IE. It's not that these major sites won't be updated ever, but most corporations aren't using the latest versions of everything on their website, even after a new browser release. If you were in the IE team's position, would you really want to break so many major sites? It's commercial suicide. The pages that won't be changed are split into those that cannot be changed: those that physically cannot be changed (e.g., HTML/CHM help files on CDs), and those that are unmaintained. It's not just intranets where content that causes issues is.

There's no point of blaming the IE team for getting the web into the situation that it's in: a large number of issues predate Trident, back when IE used Mosaic, and other issues are a result of multiple things, not least things being invented during the browser wars (DOM and CSS most notably) and browser vendors trying to extend them to outdo the opposition (Netscape is just as guilty of this), but also because of the vagueness of how they are specified (or, in some cases, the total lack of specification).

As the issue is with pages that specifically hack for IE (and those hacks now cause issues), this doesn't mean other browsers will need to spend their time reverse-engineering even more modes — the issue is mostly with sites that rely on IE's own quirky behaviour while at the same time relying on other browsers working fine. It is a non-issue for everyone apart from Microsoft.

Alternatives For A Switch

That all said, the proposed switch,

X-UA-Compatible

, is, in my option, a horrible solution to the problem at hand. We need another switch that behaves likes the DOCTYPE switch (i.e., it doesn't limit you to a single browser and a single version of that browser). I admit, I'm not sure how to implement this (though I think a version attribute on the MIME type, like Gecko's JavaScript 1.7 solution, is a good basis for ideas). It will need to be opt-in, like the DOCTYPE switch, without question, and also not conflict with Gecko's JavaScript version switch. Let me wave a banner saying I use CSS 2.1 and ECMA-262 3rd edition (the latest version of ECMAScript), and I'll be happy.

Only three other alternatives for switches have come up repeatedly:

Rebranding IE
While a quick glance at this makes it seem like a good option, looking deeper reveals the huge flaws: people equate IE as being the web, and a huge number of sites sniff the UA (try using a minor browser like Konqueror to find out quite how large an issue this is).
Date targeting
This is specifying the date at which a page was authored. I can assure you, that there are plenty of pages authored in the ten years since HTML 4.01 was published as a recommendation that don't use standards mode. It also assumes that the author has access to the latest version of the browser (IE7 isn't available on Windows 2000, which is still widely used, for example).
XHTML
The final thing that has come up repeatedly is using XHTML as a switch (and by XHTML, I mean actual XHTML served as application/xhtml+xml and not XHTML expected to be parsed as HTML), which falls down as soon as you look at its backwards compatibility, and the difficulty of using XML in the real world. That said, there's no question in my mind that XHTML needs to trigger "edge" (as it does everywhere else).

Multiple Engines

One thing that I think IE8 needs to avoid having is actually shipping with multiple engines: everything should be doable with switches within the engine (this is already done everywhere else). The

canvas

element works just as well in quirks mode as it does in standards mode in Gecko and WebKit. There's no reason why this can't be the case in IE8 too: the IE team reverse engineered Netscape better than Netscape could themselves, so there's no reason why they can't do this to themselves, and create a compatible engine.

Dealing With The Future

As I see it, there is only one way for IE to move forwards with its standards support, which is with an opt-in switch. I feel that IE is in an identical position to what it was in when the DOCTYPE switch was introduced: somehow, backwards compatibility must be kept while allowing standards compliant sites to work as intended.

What would be nice for many in the web development community would be for multiple releases of the IE8 betas be made: one which switches into IE8 Standards Mode on encountering a standards-mode DOCTYPE (i.e., replacing the current standards mode), and one that only switches into it on a new switch. Let people see how widespread site breakage is.

On Interoperability, HTML 5, and Ogg

Tags: , , , , , January 23, 2008 (0 comments)

Just over a month ago, there were all kinds of accusations going around about Apple, Microsoft, and Nokia paying to get Ogg/Theora/Vorbis removed from the HTML 5 draft. However, as someone within both the W3C HTML WG and the WHATWG I find these accusations wholly unbelievable.

Unquestionably the first issue I'll have to clear up is why they would be removed otherwise: without the support of Apple, Microsoft, and Nokia (the former two who make up a large percentage of the web browser marketshare) any mandated format for the media elements (

audio

and

video

) is pointless. The entire point of mandating a format is to have a common format that can relied upon (and, as such, it really needs to be a MUST not a SHOULD) — if major companies consider Theora a large financial risk they simply will not implement the specification, and what is a specification worth if it does not create interoperability? Without interoperability, there is no difference from the specification not existing at all.

But why is Theora a large financial risk, and why only for major companies?

  • Video codecs are known to be a hotspot for submarine patents (and no matter how complete a patent search you do, you can never be certain there are not unpublished patents covering a format).
  • The financial risk is huge — when submarine patents do arise the fines tend to be in the hundreds of millions (if not billions) of USDs.
  • Anybody who holds a submarine patent will wait a while before pouncing, and even then only on to a major company (for a small company won't have shipped enough products in breach of the patent and have enough money to be worthwhile suing).

"But I read that Theora was patent-free! Surely this is all irrelevant‽" has been asked all too many times — it is known that Theora isn't patent-free (though the known ones held by On2 have got an irrevocable royalty-free license), and it is not known if there are any submarine patents (by their very definition (i.e., patents that aren't published for a long time after they are applied for) they cannot be known about).

If video codecs are patent risks, how come companies continue to add new support for codecs such as H.264? The answer is a simple one: they will add support for new codecs when the new codec is technically superior to what is already supported or there is a significant body of already deployed content that requires support for the new codec; though many people have tried to spin this issue by omitting is technically superior to make an argument that they can't under their argument add any new codecs ever (as the last clause is mainly for legacy codecs, and as this post later explains it is highly unlikely to be used to add any new codecs nowadays).

Many have argued that Wikimedia's use of Theora is a significant body of content, however, compared with the number of videos on the web (undoubtedly predominantly pornography) this is wholly insignificant (it is extremely unlikely to be as large as even 0.001% of video on the web).

So, to get support for a codec such as Theora we need people to distribute data in Theora. There are major issues with this: most people will only use codecs that have encoders that are bundled with software, and that are widely supported. We don't get encoders for codecs unless they are technically superior to what is currently available or are widely supported by decoders; and we don't get decoders for codecs unless they are technically superior to what is currently available or there is a significant body of already deployed content that requires support for the new codecs (which doesn't exist until we get support in the encoders which we don't get until we get support in the decoders, ad infinitum). Realistically, this means the only way to get support for a new codec (and not one that is already widely deployed) is by it being technically superior to what is already used.

What can we use as a base video codec, though? This is a very hard question, and not something that can easily be answered. It is, however, clear that Theora is not an option (nor is any MPEG standard, as these all have known non-royalty-free patents). Arguing that Theora should be put back in is pointless — it's not what we want in a specification. We want interoperability. There are a large number of people both within the W3C and outside of it looking for a suitable codec. To quote what the specification currently lists as requirements:

  • We need a codec that is known to not require per-unit or per-distributor licensing,
  • That is compatible with the open source development model,
  • That is of sufficient quality as to be usable,
  • And that is not an additional submarine patent risk for large companies.

Any suggestions that meet all four of these requirements would be warmly welcomed by the HTML WG.

However, I have kept Ogg/Vorbis out of the entire body of this post. Neither of these is an issue. Both already have major companies shipping them in bulk (Microsoft, Epic, and Rockstar all have used Ogg/Vorbis in major products). That said, it cannot be said for certain that these can be used with the video codec (some video codecs require certain containers or audio codecs). Why remove these, though? It makes little sense to define what MUST be used for some of the media elements but not for others.

Page:  1 … 3 4 5 … 31