What IE does is insane. What Opera does is also insane. Neither of
those options is something that I can put in a specification with a
straight face.
This leaves the Mozilla/Safari method.
It's weird, though. If you look at the two examples above, you'll
notice that their respective markups start the same —
both of them start with this markup:
<!DOCTYPE html><em><p>X
Yet the end result is quite different, with one of the elements
(the p) having different
parents in the two cases. So when do the browsers decide what
to do? They can't be buffering content up and deciding what to do
later, since that would break incremental rendering. So what exactly
is going on?
Hrm. They disagree. Mozilla is using the "malformed" version, and
Safari is using the "well-formed" version. Why? How do they decide?
Let's look at Safari first, by running a script while the parser is
running. First, the simple case:
<!DOCTYPE html>
<em>
<p>
XY
<script>
var p = document.getElementsByTagName('p')[0];
p.title = p.parentNode.tagName;
</script>
</p>
</em>
Result:
HTML
BODY
EM
#text:
Ptitle="EM"
#text:
XY
SCRIPT
#text:
var p = document.getElementsByTagName('p')[0];
p.title = p.parentNode.tagName;
#text:
#text:
Exactly as we'd expect. The parentNode of the p element as shown in the DOM tree view is
the same as shown in the title
attribute value, namely, the em
element.
<!DOCTYPE html>
<em>
<p>
X
<script>
var p = document.getElementsByTagName('p')[0];
p.title = p.parentNode.tagName;
</script>
</em>
Y
</p>
Result:
HTML
BODY
EM
#text:
Ptitle="EM"
EM
#text:
X
SCRIPT
#text:
var p = document.getElementsByTagName('p')[0];
p.title = p.parentNode.tagName;
#text:
#text:
Y
Wait, what?
When the embedded script ran, the parent of the p was the em, but when the parser had finished, the
DOM had changed, and the parent was no longer the em node!
<!DOCTYPE html>
<em>
<p>
X
<script>
var p = document.getElementsByTagName('p')[0];
p.setAttribute('a', p.parentNode.tagName);
</script>
</em>
Y
<script>
var p = document.getElementsByTagName('p')[0];
p.setAttribute('b', p.parentNode.tagName);
</script>
</p>
...we find:
HTML
BODY
EM
#text:
Pa="EM"b="BODY"
EM
#text:
X
SCRIPT
#text:
var p = document.getElementsByTagName('p')[0];
p.setAttribute('a', p.parentNode.tagName);
#text:
#text:
Y
SCRIPT
#text:
var p = document.getElementsByTagName('p')[0];
p.setAttribute('b', p.parentNode.tagName);
#text:
...which is to say, the parent changes half way through! (Compare
the a and b attributes.)
What actually happens is that Safari notices that something bad has
happened, and moves the element around in the DOM. After the
fact. (If you remove the p element
from the DOM in that first script block, then Safari
crashes.)
How about Mozilla? Let's try the same
trick. The result:
DOCTYPE: html
HTML
HEAD
BODY
EM
#text:
Pa="BODY"b="BODY"
#text:
EM
#text: X
SCRIPT
#text:
var p = document.getElementsByTagName('p')[0];
p.setAttribute('a', p.parentNode.tagName);
#text:
#text:
Y
SCRIPT
#text:
var p = document.getElementsByTagName('p')[0];
p.setAttribute('b', p.parentNode.tagName);
#text:
It doesn't reparent the node. So what does Mozilla do?
It turns out that Mozilla does a pre-parse of the source, and if a
part of it is well-formed, it creates a well-formed tree for it, but
if the markup isn't well-formed, or if there are any script blocks, or, for that matter, if
the TCP/IP packet boundary happens to fall in the wrong place, or if
you write the document out in two document.write()s
instead of one, then it'll make the more thorough nesting that handles
ill-formed content.
Who would have thought that you would find Heisenberg-like quantum
effects in an HTML parser. I mean, I knew they were obscure, but this
is just taking the biscuit.
The problem is I now have to determine which of these four options
to make the other three browsers implement (that is, which do I put in
the spec). What do you think is the most likely to be accepted by the
others? As a reminder, the options are incestual elements that can be
their own uncles, elements who have secret lives in the rendering
engine, elements that change their mind about who their parents are
half-way through their childhood, and quantum elements whose parents
change depending on whether you observe their birth or not.
The key requirements are probably:
Coherence: scripts that rely on DOM invariants (like the fact that the DOM is a tree) shouldn't go off into infinite loops.
Transparency: we shouldn't have to describe a whole extra section that explains how the CSS rendering engine applies to HTML DOMs; CSS should just work on the real DOM as you would see it from script.
Predictability: it shouldn't depend on, e.g., the protocol or network conditions — every browser should get the same DOM for the same original markup in all situations.
The least worse option is probably the Safari-style on-the-fly reparenting, I think, but I'm not sure. It's the only one that fits those requirements. Is there a fifth option I'm missing?
January 1999. I'm nineteen, in my first year
studying Physics at Bath
University. I read an SGML tutorial (maybe this
one from 1995). I wrote a testcase
. I
filed a
bug, in which I wrote:
Comment delimiters are "--" while inside tags.
Thus: <!-- in -- -- in -- -- in -->
where "in" shows what is commented.
On the test page quoted, all is explained.
October 1999. The code for the fix is turned
on along with the standards-mode HTML parser. Mozilla is now the
first "major" browser to support SGML-style comments.
September 2000. The UN Web site breaks because it
triggers standards mode but uses incorrect comment syntax. Mozilla
drops full SGML comment parsing.
March 2001. Mozilla re-enables its strict comment
parsing; evangelism is used to convince the broken sites to fix their
markup.
May 2003. Netscape devedge publishes a document on
the matter to help the Mozilla evangelists explain this to
authors.
July 2003. I open a bug in the Opera bug
database to get Opera to implement SGML comment parsing.
January 2004. I file another bug in the Opera bug
database, having forgotten about the earlier one, to get Opera to
implement SGML comment parsing.
February 2005. Håkon and I write the first
draft of the Acid2 test.
March 2005. While giving a workshop on how to
create test cases at Opera, I find that http://www.wassada.com/ renders
correctly in Mozilla and fails to render in Opera precisely because
Mozilla renders comments according to the SGML way and Opera
doesn't. Over Håkon's objections, I insist on including a test
for the SGML comment syntax in Acid2, citing the Wassada site as proof
that we need to get interoperability on the matter. Acid2
is announced.
October 2005. Opera
fixes SGML comment parsing as part of their Acid2 work, after many
complaints internally telling me I was wrong to include this part of
the test. I point to the Wassada site. They point to the dozens of
sites that break because of this change. I point to the fact that they
aren't broken in Mozilla. They realise their fix was not quite right,
and make things work, but still grumble about it being stupid.
November 2005. Mark writes a long
document explaining the SGML comment parsing mode. Håkon
proposes removing this part of the test from Acid2. I point out that
as long as the specs require this, we don't have a good reason to
remove it from the test.
I'd like to apologise to everyone whose time I've wasted by
insisting on following the specs on this matter for the past seven
years. You probably number in the hundreds by now. Sometimes, the spec
is wrong, and we just have to fix it. I'm sorry it took me so long to
realise that this was the case here.
My current main project at work is writing the HTML5
Parser Specification, which is a specification of how browsers
should parse HTML, including all the fancy error handling logic that
up to now has been left officially undefined.
Of course, although it has been left undefined as far as the
specifications go, any browser that wants to even attempt to
render any useful portion of the Web quickly finds itself
reverse-engineering all the other browsers in an attempt to render the
existing content, because, to a rough approximation, all the content
on the Web is errorneous, invalid, or non-conformant.
This reverse-engineering has been imperfect, though, since it's not
exactly easy to determine how another browser works when you have an
infinite range of possible inputs.
I've looked at how browsers
handle tag soup before, but now that I'm writing a spec
for this, I've run into a host of other issues, and I finally just wrote a
tool to see how browsers parsed HTML. It shows your input, the DOM
view, the rendered HTML view, and the results of the DOM Level 0
innerHTML attribute, as well as the
compatMode state and what the browser thinks is the
document title. It's a useful tool, because it lets you rapidly check
theories, instead of having to go through the normal
edit-save-reload-examine cycle.
I've found some interesting things I'd like to show you.
To explain the notation I'm going to be using, let's start with
something simple:
<!DOCTYPE HTML><title>Hello World</title><p title="example">Some text.</p><!-- A comment. -->
In a compliant browser, the DOM tree would be as follows:
DOCTYPE: HTML
HTML
HEAD
TITLE
#text: Hello World
BODY
Ptitle="example"
#text: Some text.
#comment: A comment.
Hopefully this should give you a good idea of the various kinds of
nodes you'll see in a DOM tree — DOCTYPEs, elements, text nodes,
attributes, and comments.
Once you understand the general scheme, consider this: the markup
above doesn't look like that in IE! Nor Safari! Nor Opera! In fact,
the only browser that gets it right is Firefox!
IE does this:
#comment: CTYPE HT
HTML
HEAD
TITLE
BODY
Ptitle="example"
#text: Some text.
#comment: A comment.
First, it mangles the DOCTYPE node, instead treating <!DOCTYPE HTML> as <!--CTYPE HT--> (that is, as a comment,
cutting the text as if it was a real comment). Then, it adds an empty
implied TITLE element, for no
apparent reason (the specs require implied HTML, HEAD, and BODY elements, but not an implied TITLE element).
Opera does this:
HTML
TITLE
#text: Hello World
BODY
PTITLE="example"
#text: Some text.
#comment: A comment.
No implied HEAD element at all! It
takes a perfectly compliant HTML document and makes a non-compliant
DOM out of it. Oops. This can cause all kinds of problems with CSS and
scripting, too.
Not to be outdone, Safari gives us this minimal approach:
HTML
HEAD
TITLE
#text: Hello World
BODY
Ptitle="example"
#text: Some text.
That's right: no DOCTYPE, no comment. Still, on a positive note, at
least what it does have is correct!
Firefox itself is far from perfect, of course. Let's explore some
of Firefox's more amusing... features. First, a lone
<TEXTAREA> element:
<!DOCTYPE html><textarea>
You would expect to get simply a blank <TEXTAREA> element, but in Firefox, this
is parsed as:
DOCTYPE: html
HTML
HEAD
BODY
TEXTAREA
#text: </HTML>
Um. Where did that text node come from?! It turns out that in
Firefox, document.close() will always just append
</HTML> to whatever was passed to
document.write(), whether that makes sense or
not. Oops!
How about duplicate
attributes? Does the first attribute win, or does the second
attribute win?
We'll gloss over the mysterious _moz-userdefined attribute, because it
isn't harmful (it's marked with a vendor-specific prefix, and is just
used internally so that the browser and editor Mozilla components can
recognise real HTML elements from bogus elements). Other than that, it
makes sense. But what happens if we remove
the (optional) <BODY> tag?
<!DOCTYPE html><html><x foo bar>
It does the same thing, right?
DOCTYPE: html
HTML
HEAD
BODY
X_moz-userdefined=""bar=""
Yeah, nothing surpr— wait! Where did our foo attribute go?!
Curious, you might think, the <OPTGROUP> tag didn't get an element, it
was just thrown away. Well, that makes sense; the element is useless
outside of a <SELECT> element
anyway. Let's remove
the completely unrelated <DIV>
tag...:
<!DOCTYPE html><optgroup>
We should just get the same DOM with the <DIV> element removed, right? Er,
nope:
DOCTYPE: html
HTML
HEAD
BODY
FORM
SELECT
OPTGROUP
Don't ask me where those elements came from.
Internet Explorer is not immune from these crazy parsing antics
either, of course. Take this
example:
<!DOCTYPE html><frameset><frameset rows="1">
It gets this DOM:
#comment: CTYPE HT
HTML
HEAD
TITLE
FRAMESET
The second <FRAMESET> tag gets
dropped, but frankly that's not that surprising: around <FRAMESET> elements a lot of things end
up dropped. No, what is surprising, is that if you change it like
this (just adding a comma on the end of the attribute's value):
<!DOCTYPE html><frameset><frameset rows="1,">
The DOM suddenly changes!:
#comment: CTYPE HT
HTML
HEAD
TITLE
FRAMESET
FRAMESETrows="1,"
There's our second <FRAMESET>!
Yup, it seems that in IE, the attributes can affect what
elements appear in the DOM. Go figure.
Talking about <FRAMESET>s,
Safari has an interesting take on this
example:
<!DOCTYPE html><body><frameset>
It does this:
HTML
BODYstyle="display:none"
FRAMESET
I don't recall putting a style
attribute anywhere! Where did it come from?
Luckily, when the parsers don't agree, it's usually a sign that no
pages depend on their behaviour; so, ironically, every time I find the
browsers disagreeing on how to parse some HTML, it gives me more
leeway to make the spec sane. Who knows, in a few years, we might have
all the browsers parsing HTML the same way! That's certainly my
aim. It would make Web development a heck of a lot easier and more
predictable.
Note: In all these examples I'm using the HTML5 DOCTYPE
(<!DOCTYPE HTML>), which triggers standards mode. The
HTML5 spec says that if you use another DOCTYPE, UAs can switch to
quirks mode, in which case all bets are off. I'm not even going to
try to specify quirks mode parsing. Hopefully, by making the
DOCTYPE short and memorable, it will encourage authors to use it
more.
Yesterday was mlk
day. I think it's great that this country values good programming
practices to the extent of having days dedicated to solving
programming problems, but it seems that most people took the day off,
so I guess it wasn't that successful. Maybe everyone is using garbage
collectors these days or something.
Kerz and I are still working hard on our layout. I'd love to have
enough room to be able to include something like a typical
mid-size yard, but instead we're having to do with just a few
small industry tracks. We recently made a joint purchase of a class
602 and its
intermediate cars. I'm not a fan of model passenger trains
normally but this baby is seriously awesome. I used to be very
skeptical of sound effects in trains. That has all changed now,
though. Märklin's sound effects modules are as high-quality as
the rest of their workmanship. The 602 sounds like a real train right
down to the individual brake squeals. In fact now I really regret
missing the chance of buying the
Ae 6/6 double set which came out last summer — it had full
mfx everything.
Speaking of which, kerz recently informed me that the Central
Station actually is a Linux embedded system that you can SSH
into. This increases my interest in it substantially.
On another note, I learnt Ancient over the weekend. Now I
understand how it is that even Maybourne could read it. I wish Norwegian had been as easy for me to learn...
For some strange reason, we got the end of last week off. As best I
can determine, there was a surplus of turkeys in the turkey farms, and
so the entire country had to be called into an emergency session of
turkey eating. I've never seen this kind of thing happen before, it
was weird. (Actually, come to think of it, when I was interning for
Netscape there was a week in November where it seemed I was the only
one working — I had assumed that there had just been some
mysterious illness, but maybe it was related to this turkey
emergency... It could be an annual thing. I'll have to keep an eye out
next year, see if it happens again.)
Still, when in Rome...
One thing that sucks about not living in one place all the time
(I've lived more than one year in four places so far — Geneva in
Switzerland, the South West of the UK, Oslo in Norway, and the bay
area) is that my friends are all spread all over the place. While I
keep in touch with all of them, mostly thanks to the magic of IRC (and
bitlbee, which makes even
proprietary IM networks look like IRC), I don't get to see any more
than about a quarter of my friends at any one time. And IRC isn't
exactly the same as seeing them.
Looks like I'll be in Europe for several weeks at a stretch next
year, though, what with the CSS working group F2F, X-Tech, and WWW2006
being back-to-back. March May might be a travel-heavy month. Hopefully
I'll be able to see a bunch of friends and family at that time.
On Sunday, Pav, kerz and I set up kerz's yard. It's pretty
cool. Kerz and I aren't quite sure what the next step will be, though
we're probably going to get some more straight, and maybe some
elevation.
On Friday I went to see Pride & Prejudice. I'm afraid to say
I've never read the book, and frankly had been avoiding any exposure
to it (including the BBC TV series), but I figured it was time to
broaden my horizons (and the Harry Potter film was sold out). The Red
Dwarf scene with the tank and the lake makes slightly more sense to me
now.
It was a good story, much better than I expected, and once I was
used to the language it was fun to listen to the way they phrased
things. I wish I spoke more like that. It's so eloquent. It also seems
that the dating scene has changed quite a bit since that story's time
period. (Not that I understood how it really worked in the movie any
more than I understand how it's supposed to work today, but that's
another story altogether.)
Back at work I'm currently having difficulties
working out how to do menus in HTML5. I
don't like what's in the spec now. Lachlan made some
good suggestions on the list, though. I hate being stumped,
because it makes me run around in circles and makes me feel very
unproductive (although of course running around in circles thinking
through the consequences of possible designs is just as important as
speccing out a design in the first place).