Hixie's Natural Log

2006-01-20 07:03 UTC Tag Soup: Crazy parsing adventures

My current main project at work is writing the HTML5 Parser Specification, which is a specification of how browsers should parse HTML, including all the fancy error handling logic that up to now has been left officially undefined.

Of course, although it has been left undefined as far as the specifications go, any browser that wants to even attempt to render any useful portion of the Web quickly finds itself reverse-engineering all the other browsers in an attempt to render the existing content, because, to a rough approximation, all the content on the Web is errorneous, invalid, or non-conformant.

This reverse-engineering has been imperfect, though, since it's not exactly easy to determine how another browser works when you have an infinite range of possible inputs.

I've looked at how browsers handle tag soup before, but now that I'm writing a spec for this, I've run into a host of other issues, and I finally just wrote a tool to see how browsers parsed HTML. It shows your input, the DOM view, the rendered HTML view, and the results of the DOM Level 0 innerHTML attribute, as well as the compatMode state and what the browser thinks is the document title. It's a useful tool, because it lets you rapidly check theories, instead of having to go through the normal edit-save-reload-examine cycle.

I've found some interesting things I'd like to show you.

To explain the notation I'm going to be using, let's start with something simple:

<!DOCTYPE HTML><title>Hello World</title><p title="example">Some text.</p><!-- A comment. -->

In a compliant browser, the DOM tree would be as follows:

DOCTYPE: HTML
HTML
- HEAD
  - TITLE
    - #text: Hello World
- BODY
  - P title="example"
    - #text: Some text.
  - #comment: A comment.

Hopefully this should give you a good idea of the various kinds of nodes you'll see in a DOM tree — DOCTYPEs, elements, text nodes, attributes, and comments.

Once you understand the general scheme, consider this: the markup above doesn't look like that in IE! Nor Safari! Nor Opera! In fact, the only browser that gets it right is Firefox!

IE does this:

#comment: CTYPE HT
HTML
- HEAD
  - TITLE
- BODY
  - P title="example"
    - #text: Some text.
  - #comment: A comment.

First, it mangles the DOCTYPE node, instead treating <!DOCTYPE HTML> as  (that is, as a comment, cutting the text as if it was a real comment). Then, it adds an empty implied TITLE element, for no apparent reason (the specs require implied HTML, HEAD, and BODY elements, but not an implied TITLE element).

Opera does this:

HTML
- TITLE
  - #text: Hello World
- BODY
  - P TITLE="example"
    - #text: Some text.
  - #comment: A comment.

No implied HEAD element at all! It takes a perfectly compliant HTML document and makes a non-compliant DOM out of it. Oops. This can cause all kinds of problems with CSS and scripting, too.

Not to be outdone, Safari gives us this minimal approach:

HTML
- HEAD
  - TITLE
    - #text: Hello World
- BODY
  - P title="example"
    - #text: Some text.

That's right: no DOCTYPE, no comment. Still, on a positive note, at least what it does have is correct!

Firefox itself is far from perfect, of course. Let's explore some of Firefox's more amusing... features. First, a lone <TEXTAREA> element:

<!DOCTYPE html><textarea>

You would expect to get simply a blank <TEXTAREA> element, but in Firefox, this is parsed as:

DOCTYPE: html
HTML
- HEAD
- BODY
  - TEXTAREA
    - #text: </HTML>

Um. Where did that text node come from?! It turns out that in Firefox, document.close() will always just append </HTML> to whatever was passed to document.write(), whether that makes sense or not. Oops!

How about duplicate attributes? Does the first attribute win, or does the second attribute win?

<!DOCTYPE html><html id="a" id="b"><body id="a" id="b">

Will the first win, or the second? I'll give you a hint. In every other browser, the first one wins. So what do you think? The second, maybe?

DOCTYPE: html
HTML id="b"
- HEAD
- BODY id="a"

If only. It turns out that the first attribute wins, except on the <HTML> tag, where the second attribute wins!

How about bogus elements?

<!DOCTYPE html><html><body><x foo bar>

Result:

DOCTYPE: html
HTML
- HEAD
- BODY
  - X _moz-userdefined="" bar="" foo=""

We'll gloss over the mysterious _moz-userdefined attribute, because it isn't harmful (it's marked with a vendor-specific prefix, and is just used internally so that the browser and editor Mozilla components can recognise real HTML elements from bogus elements). Other than that, it makes sense. But what happens if we remove the (optional) <BODY> tag?

<!DOCTYPE html><html><x foo bar>

It does the same thing, right?

DOCTYPE: html
HTML
- HEAD
- BODY
  - X _moz-userdefined="" bar=""

Yeah, nothing surpr— wait! Where did our foo attribute go?!

Finally let's look at what happens when you play with the <OPTGROUP> tag:

<!DOCTYPE html><div><optgroup>

The DOM is:

DOCTYPE: html
HTML
- HEAD
- BODY
  - DIV

Curious, you might think, the <OPTGROUP> tag didn't get an element, it was just thrown away. Well, that makes sense; the element is useless outside of a <SELECT> element anyway. Let's remove the completely unrelated <DIV> tag...:

<!DOCTYPE html><optgroup>

We should just get the same DOM with the <DIV> element removed, right? Er, nope:

DOCTYPE: html
HTML
- HEAD
- BODY
  - FORM
    - SELECT
      - OPTGROUP

Don't ask me where those elements came from.

Internet Explorer is not immune from these crazy parsing antics either, of course. Take this example:

<!DOCTYPE html><frameset><frameset rows="1">

It gets this DOM:

#comment: CTYPE HT
HTML
- HEAD
  - TITLE
- FRAMESET

The second <FRAMESET> tag gets dropped, but frankly that's not that surprising: around <FRAMESET> elements a lot of things end up dropped. No, what is surprising, is that if you change it like this (just adding a comma on the end of the attribute's value):

<!DOCTYPE html><frameset><frameset rows="1,">

The DOM suddenly changes!:

#comment: CTYPE HT
HTML
- HEAD
  - TITLE
- FRAMESET
  - FRAMESET rows="1,"

There's our second <FRAMESET>! Yup, it seems that in IE, the attributes can affect what elements appear in the DOM. Go figure.

Talking about <FRAMESET>s, Safari has an interesting take on this example:

<!DOCTYPE html><body><frameset>

It does this:

HTML
- BODY style="display:none"
- FRAMESET

I don't recall putting a style attribute anywhere! Where did it come from?

Luckily, when the parsers don't agree, it's usually a sign that no pages depend on their behaviour; so, ironically, every time I find the browsers disagreeing on how to parse some HTML, it gives me more leeway to make the spec sane. Who knows, in a few years, we might have all the browsers parsing HTML the same way! That's certainly my aim. It would make Web development a heck of a lot easier and more predictable.

Note: In all these examples I'm using the HTML5 DOCTYPE (<!DOCTYPE HTML>), which triggers standards mode. The HTML5 spec says that if you use another DOCTYPE, UAs can switch to quirks mode, in which case all bets are off. I'm not even going to try to specify quirks mode parsing. Hopefully, by making the DOCTYPE short and memorable, it will encourage authors to use it more.

Pingbacks: 1 2 3 4 5 6

2006-01-17 23:30 UTC Memory Leaks

Yesterday was mlk day. I think it's great that this country values good programming practices to the extent of having days dedicated to solving programming problems, but it seems that most people took the day off, so I guess it wasn't that successful. Maybe everyone is using garbage collectors these days or something.

Kerz and I are still working hard on our layout. I'd love to have enough room to be able to include something like a typical mid-size yard, but instead we're having to do with just a few small industry tracks. We recently made a joint purchase of a class 602 and its intermediate cars. I'm not a fan of model passenger trains normally but this baby is seriously awesome. I used to be very skeptical of sound effects in trains. That has all changed now, though. Märklin's sound effects modules are as high-quality as the rest of their workmanship. The 602 sounds like a real train right down to the individual brake squeals. In fact now I really regret missing the chance of buying the Ae 6/6 double set which came out last summer — it had full mfx everything.

Speaking of which, kerz recently informed me that the Central Station actually is a Linux embedded system that you can SSH into. This increases my interest in it substantially.

On another note, I learnt Ancient over the weekend. Now I understand how it is that even Maybourne could read it. I wish Norwegian had been as easy for me to learn...

2005-11-29 04:47 UTC Rome, magic, steps, tanks and circles

For some strange reason, we got the end of last week off. As best I can determine, there was a surplus of turkeys in the turkey farms, and so the entire country had to be called into an emergency session of turkey eating. I've never seen this kind of thing happen before, it was weird. (Actually, come to think of it, when I was interning for Netscape there was a week in November where it seemed I was the only one working — I had assumed that there had just been some mysterious illness, but maybe it was related to this turkey emergency... It could be an annual thing. I'll have to keep an eye out next year, see if it happens again.)

Still, when in Rome...

One thing that sucks about not living in one place all the time (I've lived more than one year in four places so far — Geneva in Switzerland, the South West of the UK, Oslo in Norway, and the bay area) is that my friends are all spread all over the place. While I keep in touch with all of them, mostly thanks to the magic of IRC (and bitlbee, which makes even proprietary IM networks look like IRC), I don't get to see any more than about a quarter of my friends at any one time. And IRC isn't exactly the same as seeing them.

Looks like I'll be in Europe for several weeks at a stretch next year, though, what with the CSS working group F2F, X-Tech, and WWW2006 being back-to-back. ~~March~~ May might be a travel-heavy month. Hopefully I'll be able to see a bunch of friends and family at that time.

On Sunday, Pav, kerz and I set up kerz's yard. It's pretty cool. Kerz and I aren't quite sure what the next step will be, though we're probably going to get some more straight, and maybe some elevation.

On Friday I went to see Pride & Prejudice. I'm afraid to say I've never read the book, and frankly had been avoiding any exposure to it (including the BBC TV series), but I figured it was time to broaden my horizons (and the Harry Potter film was sold out). The Red Dwarf scene with the tank and the lake makes slightly more sense to me now.

It was a good story, much better than I expected, and once I was used to the language it was fun to listen to the way they phrased things. I wish I spoke more like that. It's so eloquent. It also seems that the dating scene has changed quite a bit since that story's time period. (Not that I understood how it really worked in the movie any more than I understand how it's supposed to work today, but that's another story altogether.)

Back at work I'm currently having difficulties working out how to do menus in HTML5. I don't like what's in the spec now. Lachlan made some good suggestions on the list, though. I hate being stumped, because it makes me run around in circles and makes me feel very unproductive (although of course running around in circles thinking through the consequences of possible designs is just as important as speccing out a design in the first place).

2005-11-21 06:40 UTC Rules of Engagement

I got shot.

Several times. It was quite fun, though I have a number of nasty bruises to show for it. I have one on my right shoulder that's a real doozie. It's a pink disc about 20mm across with a dark red dot in the middle and a rim of red dots around the edge. It doesn't hurt quite as much as the one just a few centimetres lower, though, which isn't even visible. Go figure.

We (Elaine, Zoran, myself, and half a dozen other Googlers) must have gotten through about ten thousands paintballs today. I think we bought a total of five crates' worth.

It was my first time getting hit by CO₂-propelled Paint-Filled Balls of Death, but we played in the beginners' fields so the competition wasn't too tough, and I was on the winning team around half the time. I even survived a significant number of those times, and managed to hit The Enemy enough times to make me feel like I knew what I was doing. (My limited experience with Rogue Spear, laser tag, and just plain old magic-heavy LARP probably helped a bit with this.)

The rules we played were very specific that if the Ball of Brightly Coloured Doom did not splatter on contact, then we were to consider that a non-fatal wound and keep playing. However, if one of these Death Balls of Goo did not explode on contact, that meant it bounced instead. First, this is more painful, but second, and more importantly, it causes you to start checking yourself for paint marks to see if it was a fatal hit. (In cases where you're not sure, it can even go as far as getting you to call a ref out — "Paint Check!" — to see if you were hurt or not.) All of which just puts you in more danger of being shot again, of course.

The overall strategy conclusions that I drew were:

Don't bother firing your gun when you are outside its effective range.
Laying down cover fire is quite effective in keeping the enemy pinned down while your own troups move ahead.
Moving ahead in the leap-frog style, with the front team firing cover and the rear team advancing to become the new front team (so that the roles are reversed) works well.
Being more aggressive pays off; defensive tactics might hold the enemy in position for a while, but won't drive them back.
There's really no reason to use team mates as cannon fodder, however tempting it might be.
Being shot hurts.

At one point I was lying flat on my back behind some cover, rolling left and right to shoot at some Enemy Troups who were behind similar cover, when one of these Spheres of Gooey Destruction was shot at me at the exact angle required for it to fly straight up my sleeve, slow down, turn around, and roll straight back out with no damage done. That was somewhat scary.

Next weekend I'm going to do something without abusing my body in the process. Two weekends in a row is quite enough, thanks.

log_e.hixie.ch