In my last post I explained what
Web specification authors should do with respect to error handling, in my opinion.
What about Web browser implementors and XML?
There is a lot of debate in Web log comments, mailing lists, and forums, about what Web browsers should do. There really shouldn't be. The spec is very clear about this.
As a Web browser implementor, you have two options:
The moment you hit a well-formedness error,
replace the entire page with an error message.
The moment you hit a well-formedness error,
discard everything from that point on, and display
what you got so far, with an error message.
There is no other option. If your browser does anything else, it is violating the specifications. If people want XML to do something else, then they should invent their own language with its own rules — but if you use XML, the above two scenarios are the only allowable scenarios.
(Actually there are a couple of other options, but really they are just variants of the above. First, you are allowed to report more than one error, so long as you don't do anything with any of the content after the first error except for reporting errors. Second, you are also allowed to complain about more than just well-formedness errors: if you are a validating parser you can check for validity as well. For performance and other reasons, these options are never really workable for Web browsers.)
I've been following the recent
that Web browsers stop processing upon hitting an error (as it does)
or whether it should have let Web browsers recover from errors in
vendor-specific ways (like HTML does) with some amusement, because
asking the question in this yes/no form misses the point:
There is a third, better option.
Since a lot of people don't really understand the problem here, I'm
going to give some background.
What's the point of a specification? It is to
ensure interoperability, so that authors get the same results on every
product that supports the technology.
Why would we ever have to worry about document
errors? Murphy said it best:
If there are two or more ways to do something, and one of those
ways can result in a catastrophe, then someone will do it.
Authors will write invalid documents. This is something that most
Web developers, especially developers who understand the specs well
enough to understand what makes a document invalid, do not really
understand. Ask someone who does HTML/CSS quality assurance (QA) for a
Web browser, or who has written code for a browser's layout engine.
They'll go on at length about the insanities that they have seen, but
the short version is that pretty much any random stream of characters
has been written by someone somewhere and been labelled as HTML.
Why is this a problem? Because Tim Berners Lee, and
later Dan Connolly, when they wrote the original specs for HTML and
HTTP, did not specify what should happen with invalid documents. This
wasn't a problem for the first five or so years of the Web.
At the start, there was no really dominant browser, so browsers
presumably just implemented the specs and left the error handling to
chance or convenience of the implementor. After a few years, though,
when the Web started taking off, Netscape's browser soared to a
dominant position. The result was that Web authors all pretty much
wrote their documents using Netscape. Still no problem really though:
Netscape's engineers didn't need to spend much time on error handling,
so long as they didn't change it much between releases.
Then, around the mid-nineties, Microsoft entered the scene. In
order to get users, they had to make sure that their browser rendered
all the Web pages in the World Wide Web. Unfortunately, at this point,
it became obvious that a large number of pages (almost all of them in
fact) relied in some way on the way Netscape handled errors.
Why did pages depend on Netscape's error handling?
Because Web developers changed their page until it looked right in
Netscape, with absolutely no concern for whether the page was
technically correct or not. I did this myself, back when I made my
first few sites. I remember reading about HTML4 shortly after that
become a W3C Recommendation and being shocked at my ignorance.
So, Microsoft reversed engineered Netscape's error handling.
They did a ridiculously good job of it. The sheer scale of
this feat is awe-inspiring. Internet Explorer reproduces
aspects of Netscape's error handling which nobody at Netscape ever
knew existed. Think about this for a minute.
Shortly after, Microsoft's browser became dominant and Netscape's
browser was reduced to a minority market share. Other browsers entered
the scene; Opera, Mozilla (the rewrite of the Netscape codebase), and
Konqueror (later to be used as the base for Safari) come to mind, as
they are still in active development. And in order to be usable, these
browsers have to make sure they render their pages just like Internet
Explorer, which means handling the errors in the same way.
Browser developers and layout engine QA engineers spend probably
more than half their total work hours debugging invalid markup trying
to work out what obscure aspect of the de facto error handling rules
are being used to obtain the desired rendering. More than half!
It's easy to see why Web browser developers tend to be of the
opinion that for future specifications, instead of having to reverse
engineer the error handling behaviour of whatever browser happens to
be the majority browser, errors should just cause the browser to abort
Summary of the argument so far: Authors will write
invalid content regardless. If the specification doesn't say what
should happen, then once there is a dominant browser, its error
handling (whether intentionally designed or just a side-effect of the
implementation) will become the de facto standard. At this point,
there is no going back, any new product that wants to interoperate has
to support those rules.
So what is the better solution? Specifications
should explicitly state what the error recovery rules are. They should
state what the authors must not do, and then tell implementors what
they must do when an author does it anyway.
This is what CSS1 did, to a large extent (although it still leaves
much undefined, and I've been trying to make the rules for handling
those errors clearer in CSS2.1 and CSS3). This is what my Web
Forms 2.0 proposal does. Specifications should ensure that
compliant implementations interoperate, whether the content is
valid or not.
Note that all this is moot if you use XML 1.x, because XML
specifies that well-formedness errors should be fatal. So if you don't
want to have this behaviour in your language, don't use XML.
About 11 months ago, I mentioned
that the W3C had so far failed to address a need in the Web community:
There is no language for Web applications. There is a language for
hypertext documents (HTML), there is a language for vector graphic
images (SVG), there is a vocabulary for embedding Math into both of
those (MathML), and there are lots of support technologies (DOM,
ECMAScript, CSS, SMIL)... But there is no language designed for
writing applications, like Voidwars (a game) or Bugzilla (an issue tracking
system) or for that matter the Mozillazine Forums or eBay auctions. What is needed is one
(or maybe more) markup languages specifically designed to allow the
semantics of sites like the above to be marked up, thus allowing for
improvements in the accessibility of such sites.
It's been nearly a year since I first mentioned this, and the only
group that seems to have done anything about this is Microsoft, with
their worryingly comprehensive set of proprietary technologies
(Avalon, XAML, WVG, etc) that appear designed to ensure vendor
I intend to do something about this (hopefully within a W3C
context, although that will depend on the politics of the situation).
If you write Web-based applications, I would be interested in hearing
about what your needs are. Please let me know: email@example.com
Maybe I've missed something. I don't know. Or maybe this is a joke. I just got a spam with the subject line This letter can only define Nigeria Scam, a.k.a 419, which starts off explaining what 419 spam is, saying that much of Nigeria's government is corrupt, and so forth. Fair enough, I thought (curious as to the goal of a spam that explained some of the story behind 419 fraud, even if this wasn't even close to an accurate explanation). Maybe this is ironic educational spam from some well-meaning, although confused, spam fighter.
Then I read paragraph 5:
The point I am making is nothing more than asking you to handle a pure deal of approximately USD$50,000,000.00, which will take approximately two weeks to conclude from
here. Then the funds clear in any account of yours after 72hrs upon the remittance.
What? I'm confused. I thought you just said this was a scam?
Maybe they are trying to increase the bar, so that only very gullible people fall for these scams?
On my way to the office (which is the staging point for my mission to today's primary objective, central Olso) I passed an old lady who appeared to be muttering to herself, and it struck me: I can no longer tell the difference between insane people, and people on hands-free mobile phones. Literally. I have no idea if she was on the phone or not. And she definitely wasn't speaking to anyone physically near her.
Later today, Tim will be arriving for a few months. I haven't seen him since August. Hopefully he'll be encouraging me to get to work slightly, ah, earlier, than I have been.
Last night I finished reading a series of seven books by Robert Doherty which I started over the new year. I bought Area 51 around the 23rd of December, finished that day or the next, spent a few days itching to buy Area 51: The Reply, which I finally did around the 26th, along with Area 51: The Mission. I then spent about 2 days reading and about 8 days itching to buy Area 51: The Sphinx, which I finally did on Monday (the 5th), along with Area 51: The Grail, Area 51: Excalibur, and Area 51: The Truth. There appear to be no real analysis sites on the Web for this series, which surprises me. (Is The Lurker's Guide an anomaly, or what? I made my entry into science fiction fandom with Babylon 5, which, at the aforementioned site, has incredibly detailed analysis of every scene of every show, cross-referenced across episodes with detailed plot descriptions, directors comments, and so forth. Did other series not cause that kind of response? Even Stargate SG-1 doesn't really seem to have that kind of detailed analysis. Although, having tried writing one for some episodes myself, I can understand that, I guess. Good analysis is long, hard work.)
Turns out there is another book, Area 51: Nosferatu, now available, with yet another (Legend (Area 51)) coming in "March" (quotemarks because I've become rather familiar with projected publication dates what with my involvement with software development, specification editing, and book proof-reading). I also noticed, while buying those books, that one of my favourite authors, Peter F. Hamilton (author of the simply stunning Night's Dawn trilogy) has some more books on sale now.
However, no more books for me for at least a week. Reading does terrible things to my productivity. I have an addictive personality and very little self-control (which is why I don't drink) so when I start reading, I have to finish, even if it is past 5am. Not something I want to keep doing for extended periods of time, really.
I'd better be off now, my exfil window is closing.