2006-01-20 07:03 UTC Tag Soup: Crazy parsing adventures
My current main project at work is writing the HTML5 Parser Specification, which is a specification of how browsers should parse HTML, including all the fancy error handling logic that up to now has been left officially undefined.
Of course, although it has been left undefined as far as the specifications go, any browser that wants to even attempt to render any useful portion of the Web quickly finds itself reverse-engineering all the other browsers in an attempt to render the existing content, because, to a rough approximation, all the content on the Web is errorneous, invalid, or non-conformant.
This reverse-engineering has been imperfect, though, since it's not exactly easy to determine how another browser works when you have an infinite range of possible inputs.
I've looked at how browsers
handle tag soup before, but now that I'm writing a spec
for this, I've run into a host of other issues, and I finally just wrote a
tool to see how browsers parsed HTML. It shows your input, the DOM
view, the rendered HTML view, and the results of the DOM Level 0
innerHTML
attribute, as well as the
compatMode
state and what the browser thinks is the
document title. It's a useful tool, because it lets you rapidly check
theories, instead of having to go through the normal
edit-save-reload-examine cycle.
I've found some interesting things I'd like to show you.
To explain the notation I'm going to be using, let's start with something simple:
<!DOCTYPE HTML><title>Hello World</title><p title="example">Some text.</p><!-- A comment. -->
In a compliant browser, the DOM tree would be as follows:
- DOCTYPE:
HTML
HTML
HEAD
TITLE
#text
: Hello World
BODY
P
title
="example
"#text
: Some text.
#comment
: A comment.
Hopefully this should give you a good idea of the various kinds of nodes you'll see in a DOM tree — DOCTYPEs, elements, text nodes, attributes, and comments.
Once you understand the general scheme, consider this: the markup above doesn't look like that in IE! Nor Safari! Nor Opera! In fact, the only browser that gets it right is Firefox!
IE does this:
#comment
: CTYPE HTHTML
HEAD
TITLE
BODY
P
title
="example
"#text
: Some text.
#comment
: A comment.
First, it mangles the DOCTYPE node, instead treating <!DOCTYPE HTML>
as <!--CTYPE HT-->
(that is, as a comment,
cutting the text as if it was a real comment). Then, it adds an empty
implied TITLE
element, for no
apparent reason (the specs require implied HTML
, HEAD
, and BODY
elements, but not an implied TITLE
element).
Opera does this:
HTML
TITLE
#text
: Hello World
BODY
P
TITLE
="example
"#text
: Some text.
#comment
: A comment.
No implied HEAD
element at all! It
takes a perfectly compliant HTML document and makes a non-compliant
DOM out of it. Oops. This can cause all kinds of problems with CSS and
scripting, too.
Not to be outdone, Safari gives us this minimal approach:
HTML
HEAD
TITLE
#text
: Hello World
BODY
P
title
="example
"#text
: Some text.
That's right: no DOCTYPE, no comment. Still, on a positive note, at least what it does have is correct!
Firefox itself is far from perfect, of course. Let's explore some
of Firefox's more amusing... features. First, a lone
<TEXTAREA>
element:
<!DOCTYPE html><textarea>
You would expect to get simply a blank <TEXTAREA>
element, but in Firefox, this
is parsed as:
- DOCTYPE:
html
HTML
HEAD
BODY
TEXTAREA
#text
: </HTML>
Um. Where did that text node come from?! It turns out that in
Firefox, document.close()
will always just append
</HTML>
to whatever was passed to
document.write()
, whether that makes sense or
not. Oops!
How about duplicate attributes? Does the first attribute win, or does the second attribute win?
<!DOCTYPE html><html id="a" id="b"><body id="a" id="b">
Will the first win, or the second? I'll give you a hint. In every other browser, the first one wins. So what do you think? The second, maybe?
- DOCTYPE:
html
HTML
id
="b
"HEAD
BODY
id
="a
"
If only. It turns out that the first attribute wins, except on the
<HTML>
tag, where the second
attribute wins!
How about bogus elements?
<!DOCTYPE html><html><body><x foo bar>
Result:
- DOCTYPE:
html
HTML
HEAD
BODY
X
_moz-userdefined
=""
bar
=""
foo
=""
We'll gloss over the mysterious _moz-userdefined
attribute, because it
isn't harmful (it's marked with a vendor-specific prefix, and is just
used internally so that the browser and editor Mozilla components can
recognise real HTML elements from bogus elements). Other than that, it
makes sense. But what happens if we remove
the (optional) <BODY>
tag?
<!DOCTYPE html><html><x foo bar>
It does the same thing, right?
- DOCTYPE:
html
HTML
HEAD
BODY
X
_moz-userdefined
=""
bar
=""
Yeah, nothing surpr— wait! Where did our foo
attribute go?!
Finally let's look at what happens when you play
with the <OPTGROUP>
tag:
<!DOCTYPE html><div><optgroup>
The DOM is:
- DOCTYPE:
html
HTML
HEAD
BODY
DIV
Curious, you might think, the <OPTGROUP>
tag didn't get an element, it
was just thrown away. Well, that makes sense; the element is useless
outside of a <SELECT>
element
anyway. Let's remove
the completely unrelated <DIV>
tag...:
<!DOCTYPE html><optgroup>
We should just get the same DOM with the <DIV>
element removed, right? Er,
nope:
- DOCTYPE:
html
HTML
HEAD
BODY
FORM
SELECT
OPTGROUP
Don't ask me where those elements came from.
Internet Explorer is not immune from these crazy parsing antics either, of course. Take this example:
<!DOCTYPE html><frameset><frameset rows="1">
It gets this DOM:
#comment
: CTYPE HTHTML
HEAD
TITLE
FRAMESET
The second <FRAMESET>
tag gets
dropped, but frankly that's not that surprising: around <FRAMESET>
elements a lot of things end
up dropped. No, what is surprising, is that if you change it like
this (just adding a comma on the end of the attribute's value):
<!DOCTYPE html><frameset><frameset rows="1,">
The DOM suddenly changes!:
#comment
: CTYPE HTHTML
HEAD
TITLE
FRAMESET
FRAMESET
rows
="1,
"
There's our second <FRAMESET>
!
Yup, it seems that in IE, the attributes can affect what
elements appear in the DOM. Go figure.
Talking about <FRAMESET>
s,
Safari has an interesting take on this
example:
<!DOCTYPE html><body><frameset>
It does this:
HTML
BODY
style
="display:none
"FRAMESET
I don't recall putting a style
attribute anywhere! Where did it come from?
Luckily, when the parsers don't agree, it's usually a sign that no pages depend on their behaviour; so, ironically, every time I find the browsers disagreeing on how to parse some HTML, it gives me more leeway to make the spec sane. Who knows, in a few years, we might have all the browsers parsing HTML the same way! That's certainly my aim. It would make Web development a heck of a lot easier and more predictable.
Note: In all these examples I'm using the HTML5 DOCTYPE
(<!DOCTYPE HTML>
), which triggers standards mode. The
HTML5 spec says that if you use another DOCTYPE, UAs can switch to
quirks mode, in which case all bets are off. I'm not even going to
try to specify quirks mode parsing. Hopefully, by making the
DOCTYPE short and memorable, it will encourage authors to use it
more.