Hixie's Natural Log

2006-01-25 06:12 UTC Tag Soup: Blocks-in-inlines

<!DOCTYPE html><em><p>XY</p></em>

What should the DOM look like? The general consensus is that the DOM should look like this:

DOCTYPE: html
HTML
- HEAD
- BODY
  - EM
    - P
      - #text: XY

That is, the p element should be completely inside (that is, a child of) the em element.

No problem so far.

Now consider this markup:

<!DOCTYPE html><em><p>X</em>Y</p>

What should the DOM look like?

This is where things start getting hairy. I've covered a similar case before, so I'll just summarise the results:

Windows Internet Explorer

The DOM is not a tree. The text node for the "Y" is a child of both the p element and the body element. Violates the DOM Core specifications.

Opera

The DOM is a simple tree, the same as for the first case, but the "Y" is not emphasised. Violates the CSS specifications.

Mozilla and Safari

The DOM looks like this:

DOCTYPE: html
HTML
- HEAD
- BODY
  - EM
  - P
    - EM
      - #text: X
    - #text: Y

...which basically means that malformed invalid markup gets handled differently than well-formed invalid markup.

In the past, I would have stopped here, made some wry comment about the insanity that is the Web, and called it a day.

But I'm trying to spec this. Stopping is not an option.

What IE does is insane. What Opera does is also insane. Neither of those options is something that I can put in a specification with a straight face.

This leaves the Mozilla/Safari method.

It's weird, though. If you look at the two examples above, you'll notice that their respective markups start the same — both of them start with this markup:

<!DOCTYPE html><em><p>X

Yet the end result is quite different, with one of the elements (the p) having different parents in the two cases. So when do the browsers decide what to do? They can't be buffering content up and deciding what to do later, since that would break incremental rendering. So what exactly is going on?

Well, let's check. What do Mozilla and Safari do for that truncated piece of markup?

Mozilla

DOCTYPE: html
HTML
- HEAD
- BODY
  - EM
  - P
    - EM
      - #text: X

Safari

HTML
- BODY
  - EM
    - P
      - #text: X

Hrm. They disagree. Mozilla is using the "malformed" version, and Safari is using the "well-formed" version. Why? How do they decide?

Let's look at Safari first, by running a script while the parser is running. First, the simple case:

<!DOCTYPE html>
<em>
 <p>
  XY
  <script>
   var p = document.getElementsByTagName('p')[0];
   p.title = p.parentNode.tagName;
  </script>
 </p>
</em>

Result:

HTML
- BODY
  - EM
    - #text:
    - P title="EM"
      - #text: XY
      - SCRIPT
        #text: var p = document.getElementsByTagName('p')[0]; p.title = p.parentNode.tagName;
      - #text:
    - #text:

Exactly as we'd expect. The parentNode of the p element as shown in the DOM tree view is the same as shown in the title attribute value, namely, the em element.

Now let's try the bad markup case:

<!DOCTYPE html>
<em>
 <p>
  X
  <script>
   var p = document.getElementsByTagName('p')[0];
   p.title = p.parentNode.tagName;
  </script>
 </em>
 Y
</p>

Result:

HTML
- BODY
  - EM
    - #text:
  - P title="EM"
    - EM
      - #text: X
      - SCRIPT
        #text: var p = document.getElementsByTagName('p')[0]; p.title = p.parentNode.tagName;
      - #text:
    - #text: Y

Wait, what?

When the embedded script ran, the parent of the p was the em, but when the parser had finished, the DOM had changed, and the parent was no longer the em node!

If we look a little closer:

<!DOCTYPE html>
<em>
 <p>
  X
  <script>
   var p = document.getElementsByTagName('p')[0];
   p.setAttribute('a', p.parentNode.tagName);
  </script>
 </em>
 Y
 <script>
  var p = document.getElementsByTagName('p')[0];
  p.setAttribute('b', p.parentNode.tagName);
 </script>
</p>

...we find:

HTML
- BODY
  - EM
    - #text:
  - P a="EM" b="BODY"
    - EM
      - #text: X
      - SCRIPT
        #text: var p = document.getElementsByTagName('p')[0]; p.setAttribute('a', p.parentNode.tagName);
      - #text:
    - #text: Y
    - SCRIPT
      - #text: var p = document.getElementsByTagName('p')[0]; p.setAttribute('b', p.parentNode.tagName);
    - #text:

...which is to say, the parent changes half way through! (Compare the a and b attributes.)

What actually happens is that Safari notices that something bad has happened, and moves the element around in the DOM. After the fact. (If you remove the p element from the DOM in that first script block, then Safari crashes.)

How about Mozilla? Let's try the same trick. The result:

DOCTYPE: html
HTML
- HEAD
- BODY
  - EM
    - #text:
  - P a="BODY" b="BODY"
    - #text:
    - EM
      - #text: X
      - SCRIPT
        #text: var p = document.getElementsByTagName('p')[0]; p.setAttribute('a', p.parentNode.tagName);
      - #text:
    - #text: Y
    - SCRIPT
      - #text: var p = document.getElementsByTagName('p')[0]; p.setAttribute('b', p.parentNode.tagName);
    - #text:

It doesn't reparent the node. So what does Mozilla do?

It turns out that Mozilla does a pre-parse of the source, and if a part of it is well-formed, it creates a well-formed tree for it, but if the markup isn't well-formed, or if there are any script blocks, or, for that matter, if the TCP/IP packet boundary happens to fall in the wrong place, or if you write the document out in two document.write()s instead of one, then it'll make the more thorough nesting that handles ill-formed content.

Who would have thought that you would find Heisenberg-like quantum effects in an HTML parser. I mean, I knew they were obscure, but this is just taking the biscuit.

The problem is I now have to determine which of these four options to make the other three browsers implement (that is, which do I put in the spec). What do you think is the most likely to be accepted by the others? As a reminder, the options are incestual elements that can be their own uncles, elements who have secret lives in the rendering engine, elements that change their mind about who their parents are half-way through their childhood, and quantum elements whose parents change depending on whether you observe their birth or not.

The key requirements are probably:

Coherence: scripts that rely on DOM invariants (like the fact that the DOM is a tree) shouldn't go off into infinite loops.
Transparency: we shouldn't have to describe a whole extra section that explains how the CSS rendering engine applies to HTML DOMs; CSS should just work on the real DOM as you would see it from script.
Predictability: it shouldn't depend on, e.g., the protocol or network conditions — every browser should get the same DOM for the same original markup in all situations.

The least worse option is probably the Safari-style on-the-fly reparenting, I think, but I'm not sure. It's the only one that fits those requirements. Is there a fifth option I'm missing?

Pingbacks: 1 2

2006-01-20 23:32 UTC People who don't realise that they're wrong

January 1999. I'm nineteen, in my first year studying Physics at Bath University. I read an SGML tutorial (maybe this one from 1995). I wrote a testcase . I filed a bug, in which I wrote:

Comment delimiters are "--" while inside tags.

Thus: <!-- in --  -- in --  -- in -->
where "in" shows what is commented.

On the test page quoted, all is explained.

February 1999. The bug is fixed.

October 1999. The code for the fix is turned on along with the standards-mode HTML parser. Mozilla is now the first "major" browser to support SGML-style comments.

September 2000. The UN Web site breaks because it triggers standards mode but uses incorrect comment syntax. Mozilla drops full SGML comment parsing.

March 2001. Mozilla re-enables its strict comment parsing; evangelism is used to convince the broken sites to fix their markup.

May 2003. Netscape devedge publishes a document on the matter to help the Mozilla evangelists explain this to authors.

July 2003. I open a bug in the Opera bug database to get Opera to implement SGML comment parsing.

January 2004. I file another bug in the Opera bug database, having forgotten about the earlier one, to get Opera to implement SGML comment parsing.

February 2005. Håkon and I write the first draft of the Acid2 test.

March 2005. While giving a workshop on how to create test cases at Opera, I find that http://www.wassada.com/ renders correctly in Mozilla and fails to render in Opera precisely because Mozilla renders comments according to the SGML way and Opera doesn't. Over Håkon's objections, I insist on including a test for the SGML comment syntax in Acid2, citing the Wassada site as proof that we need to get interoperability on the matter. Acid2 is announced.

April 2005. Safari fixes SGML comment parsing as part of their Acid2 work. Hyatt confesses bemusement regarding this feature, joining Håkon in thinking I was wrong to insist we include this part of the test.

June 2005. Konqueror fixes SGML comment parsing as part of their Acid2 work.

October 2005. Opera fixes SGML comment parsing as part of their Acid2 work, after many complaints internally telling me I was wrong to include this part of the test. I point to the Wassada site. They point to the dozens of sites that break because of this change. I point to the fact that they aren't broken in Mozilla. They realise their fix was not quite right, and make things work, but still grumble about it being stupid.

November 2005. Mark writes a long document explaining the SGML comment parsing mode. Håkon proposes removing this part of the test from Acid2. I point out that as long as the specs require this, we don't have a good reason to remove it from the test.

December 2005. Prince implement SGML comment parsing in their efforts to pass Acid2, but privately raise concerns about this parsing requirement.

January 2006. I realise I was wrong.

I've now fixed the spec and fixed the Acid2 test.

I'd like to apologise to everyone whose time I've wasted by insisting on following the specs on this matter for the past seven years. You probably number in the hundreds by now. Sometimes, the spec is wrong, and we just have to fix it. I'm sorry it took me so long to realise that this was the case here.

Pingbacks: 1 2

2006-01-20 07:03 UTC Tag Soup: Crazy parsing adventures

My current main project at work is writing the HTML5 Parser Specification, which is a specification of how browsers should parse HTML, including all the fancy error handling logic that up to now has been left officially undefined.

Of course, although it has been left undefined as far as the specifications go, any browser that wants to even attempt to render any useful portion of the Web quickly finds itself reverse-engineering all the other browsers in an attempt to render the existing content, because, to a rough approximation, all the content on the Web is errorneous, invalid, or non-conformant.

This reverse-engineering has been imperfect, though, since it's not exactly easy to determine how another browser works when you have an infinite range of possible inputs.

I've looked at how browsers handle tag soup before, but now that I'm writing a spec for this, I've run into a host of other issues, and I finally just wrote a tool to see how browsers parsed HTML. It shows your input, the DOM view, the rendered HTML view, and the results of the DOM Level 0 innerHTML attribute, as well as the compatMode state and what the browser thinks is the document title. It's a useful tool, because it lets you rapidly check theories, instead of having to go through the normal edit-save-reload-examine cycle.

I've found some interesting things I'd like to show you.

To explain the notation I'm going to be using, let's start with something simple:

<!DOCTYPE HTML><title>Hello World</title><p title="example">Some text.</p><!-- A comment. -->

In a compliant browser, the DOM tree would be as follows:

DOCTYPE: HTML
HTML
- HEAD
  - TITLE
    - #text: Hello World
- BODY
  - P title="example"
    - #text: Some text.
  - #comment: A comment.

Hopefully this should give you a good idea of the various kinds of nodes you'll see in a DOM tree — DOCTYPEs, elements, text nodes, attributes, and comments.

Once you understand the general scheme, consider this: the markup above doesn't look like that in IE! Nor Safari! Nor Opera! In fact, the only browser that gets it right is Firefox!

IE does this:

#comment: CTYPE HT
HTML
- HEAD
  - TITLE
- BODY
  - P title="example"
    - #text: Some text.
  - #comment: A comment.

First, it mangles the DOCTYPE node, instead treating <!DOCTYPE HTML> as  (that is, as a comment, cutting the text as if it was a real comment). Then, it adds an empty implied TITLE element, for no apparent reason (the specs require implied HTML, HEAD, and BODY elements, but not an implied TITLE element).

Opera does this:

HTML
- TITLE
  - #text: Hello World
- BODY
  - P TITLE="example"
    - #text: Some text.
  - #comment: A comment.

No implied HEAD element at all! It takes a perfectly compliant HTML document and makes a non-compliant DOM out of it. Oops. This can cause all kinds of problems with CSS and scripting, too.

Not to be outdone, Safari gives us this minimal approach:

HTML
- HEAD
  - TITLE
    - #text: Hello World
- BODY
  - P title="example"
    - #text: Some text.

That's right: no DOCTYPE, no comment. Still, on a positive note, at least what it does have is correct!

Firefox itself is far from perfect, of course. Let's explore some of Firefox's more amusing... features. First, a lone <TEXTAREA> element:

<!DOCTYPE html><textarea>

You would expect to get simply a blank <TEXTAREA> element, but in Firefox, this is parsed as:

DOCTYPE: html
HTML
- HEAD
- BODY
  - TEXTAREA
    - #text: </HTML>

Um. Where did that text node come from?! It turns out that in Firefox, document.close() will always just append </HTML> to whatever was passed to document.write(), whether that makes sense or not. Oops!

How about duplicate attributes? Does the first attribute win, or does the second attribute win?

<!DOCTYPE html><html id="a" id="b"><body id="a" id="b">

Will the first win, or the second? I'll give you a hint. In every other browser, the first one wins. So what do you think? The second, maybe?

DOCTYPE: html
HTML id="b"
- HEAD
- BODY id="a"

If only. It turns out that the first attribute wins, except on the <HTML> tag, where the second attribute wins!

How about bogus elements?

<!DOCTYPE html><html><body><x foo bar>

Result:

DOCTYPE: html
HTML
- HEAD
- BODY
  - X _moz-userdefined="" bar="" foo=""

We'll gloss over the mysterious _moz-userdefined attribute, because it isn't harmful (it's marked with a vendor-specific prefix, and is just used internally so that the browser and editor Mozilla components can recognise real HTML elements from bogus elements). Other than that, it makes sense. But what happens if we remove the (optional) <BODY> tag?

<!DOCTYPE html><html><x foo bar>

It does the same thing, right?

DOCTYPE: html
HTML
- HEAD
- BODY
  - X _moz-userdefined="" bar=""

Yeah, nothing surpr— wait! Where did our foo attribute go?!

Finally let's look at what happens when you play with the <OPTGROUP> tag:

<!DOCTYPE html><div><optgroup>

The DOM is:

DOCTYPE: html
HTML
- HEAD
- BODY
  - DIV

Curious, you might think, the <OPTGROUP> tag didn't get an element, it was just thrown away. Well, that makes sense; the element is useless outside of a <SELECT> element anyway. Let's remove the completely unrelated <DIV> tag...:

<!DOCTYPE html><optgroup>

We should just get the same DOM with the <DIV> element removed, right? Er, nope:

DOCTYPE: html
HTML
- HEAD
- BODY
  - FORM
    - SELECT
      - OPTGROUP

Don't ask me where those elements came from.

Internet Explorer is not immune from these crazy parsing antics either, of course. Take this example:

<!DOCTYPE html><frameset><frameset rows="1">

It gets this DOM:

#comment: CTYPE HT
HTML
- HEAD
  - TITLE
- FRAMESET

The second <FRAMESET> tag gets dropped, but frankly that's not that surprising: around <FRAMESET> elements a lot of things end up dropped. No, what is surprising, is that if you change it like this (just adding a comma on the end of the attribute's value):

<!DOCTYPE html><frameset><frameset rows="1,">

The DOM suddenly changes!:

#comment: CTYPE HT
HTML
- HEAD
  - TITLE
- FRAMESET
  - FRAMESET rows="1,"

There's our second <FRAMESET>! Yup, it seems that in IE, the attributes can affect what elements appear in the DOM. Go figure.

Talking about <FRAMESET>s, Safari has an interesting take on this example:

<!DOCTYPE html><body><frameset>

It does this:

HTML
- BODY style="display:none"
- FRAMESET

I don't recall putting a style attribute anywhere! Where did it come from?

Luckily, when the parsers don't agree, it's usually a sign that no pages depend on their behaviour; so, ironically, every time I find the browsers disagreeing on how to parse some HTML, it gives me more leeway to make the spec sane. Who knows, in a few years, we might have all the browsers parsing HTML the same way! That's certainly my aim. It would make Web development a heck of a lot easier and more predictable.

Note: In all these examples I'm using the HTML5 DOCTYPE (<!DOCTYPE HTML>), which triggers standards mode. The HTML5 spec says that if you use another DOCTYPE, UAs can switch to quirks mode, in which case all bets are off. I'm not even going to try to specify quirks mode parsing. Hopefully, by making the DOCTYPE short and memorable, it will encourage authors to use it more.

Pingbacks: 1 2 3 4 5 6