Hixie's Natural Log

2002-11-25 22:32 UTC Why semantic markup is so important

On the one hand we have shaver finally breaking down and providing an RSS feed, while on the other we have Tantek describing the start of a rebellion against RSS. You see, people are finally catching on than XHTML has actual semantics (sound familiar?), and thus software could be doing a lot of the content aggregation work for us.

Some pointers for aggregator implementers then, if we're going to go with this XHTML-instead-of-RSS idea (which I think we should, despite my having two RSS feeds):

  1. Use rel="bookmark" like Tantek suggests. (I have that set up here now, by the way.)
  2. Treat the XHTML document as an outline, instead of a list of posts. I've seen many Web logs that are not simply post-post-post, but have much more hierarchical structures, like person-day-post-topic.
  3. Don't use class attributes, unless the XHTML working group define a set of normative classes (which wouldn't be a bad idea, actually). A key aspect of class attributes is that they are totally semantic-free: they mean nothing. A "rose main" class in a document could mean that the element is the most important aspect of the document as well as being one which has risen above the rest, or it could mean that it is the central part of Rose's speech, or it could mean that it is the part of the document representing the pink hand. All that you can assume is that it is a space separated list of author-defined tokens.
  4. Don't require content authors to constrain their markup habits — I am going to use relative links and other such features that require aggregator implementors to actually think before writing code. There is an XHTML specification, so XHTML UAs, and that includes these aggregators, are going to have to implement it.
  5. Do require valid content. Intelligence in UAs can be taken a long way, but as we have found with the tag soup mess that is text/html on the web today, the line should be drawn at coping with invalid markup.

It will be very interesting to see if this takes off.

Pingbacks: 1

2002-11-24 00:19 UTC Markup Challenge: aaronsw.com

Next up in the Markup Challenge is Aaron Swartz. Aaron is the HWG's representative on the RDF core working group! I could never understand RDF, and have a great respect for those who can.

I must once again remind you that while you read this site review, you should bear in mind that it is purposefully intensely pedantic. I'm coming from the view that complying to the spirit of the specifications ("theoretical accessibility") is more important than what the site is rendered in, in practice ("practical accessibility", if you will). This is not a completely "ivory tower" position; as Web browsers improve, standards-compliant content becomes more and more accessible and usable in different contexts, while non-compliant content becomes less and less useful.

Also, Aaron, like Mark, volunteered for this and asked me to be as pedantic as possible. So...


The first problem is a minor one given the state of our domain name space, but ".com" was supposed to be for companies, and this site is not a commercial site. Of course, I'm guilty of domain name polution myself, so maybe I shouldn't be so eager to raise this issue!


A great start: the document validates to a Strict DTD.

XHTML sent as text/html

All my readers are painfully aware of my position on XHTML being sent as text/html (or at least, that's the impression I get from all the e-mails I receive with the disclaimer "Yes, I know my site is sent as text/html, I'm so sorry, please don't shoot me, please"). So they would probably expect me to complain loudly here about how Aaron is doing the wrong thing, etc.

Well, no such luck. Aaron and I are currently in the midst of a discussion regarding the pros and cons of XHTML-as-text/html, and so it is possible that I may have my mind changed (as I did about application/xhtml+xml vs text/xml). Therefore, while I wait to see how that unfolds, my opinions on XHTML-as-text/html are on hold.

Correct MIME types

The RSS feed and the stylesheets are correctly sent with the right MIME types. Incidentally, as several people have pointed out to me, application/rss+xml is technically not a valid MIME type. Neither is the almost ubiquitous text/javascript. But if the standards communites are going to drag their feet defining obvious MIME types, the world will continue without them, sadly. There is an obsolete application/rss+xml MIME type registration ID, does anyone know why it didn't get moved to the standards track category (like text/xml) or to the informational categary (like text/html)?

No defined character set

The HTML files contain no encoding information, which means that a number of conflicting specifications (notably HTTP, MIME, RFC2854, HTML 4.01, XHTML 1.0 and XML) enter the game to try to determine which character set should be used to decode them, with the only clear conclusion being "guess". Luckily, Aaron only uses codepoints that are a common subset of US-ASCII, UTF-8, and ISO-8859-1, so the confusion doesn't cause any ambiguities.

The CSS and RSS files similarly have no explicit encoding information, but this is less of a problem. Only encodings that are a superset of US-ASCII may be used with text/css so if the file only contains US-ASCII characters the point is moot, and RSS files have well defined rules for determining the character set (the application/rss+xml specification points to section 3.2 of RFC 3023 which points to section 4.3.3 of the XML specification).

Setting colours without backgrounds

I had my default background colour set to a dark red (#BB0000, it makes a nice background) so I didn't realise that there was a header on the pages — they are coloured #b00, the same colour! My user stylesheet looked like this:

:root { background: #BB0000; color: white; }
:link { color: yellow; background: transparent; }
:visited { color: orange; background: transparent; }

This stylesheet also made it very hard read to read the lines giving the dates of when the articles were posted.

While I applaud the idea of leaving most of the colours to user, you have to be careful to always give colours and backgrounds together, so as to avoid this kind of clash. (When setting a colour to transparent, make sure to then set all colours, so as to prevent further clashes.)

<div id="banner">

As with Mark's <div id="logo">, this element appears to be there purely for stylistic reasons and doesn't seem to add anything to the structore of the document. As such, it should be removed.

The contents of the element are in the form "title" "subtitle", which isn't very well handled by XHTML1, unfortunately. A better solution would be either to use both <h1> and <h2> in series, or, preferably in my opinion, to use a single <h1> with the important part emphasised with an <em> element or some such. This problem was recently discussed in www-html, hopefully the HTML WG will notice and give us a subheading element, or at least explicitly state how to mark up subheadings. This is a problem I often hit.

<div class="content"><div id="main">

At least one of those <div>s is redundant, if not both.

<h2 class="title">

That's a tautology: the class isn't adding anything.

alt="H4X0r ECONOMIST: make economy"

That alternate text is more like a title than alternate text. A much better alternate text would be "H4X0R ECONOMIST. lol GPL wh4Tev3r make economy. Alan Greenspan scowls. The Supreme Court finds that, owing to GPL compliance issues raised in the case Free Software Federation v. Greenspan, our nation must henceforth be known as the United Nations of GNUmerica. William Rhenquist has a serious expression. One week later: Richard Stallman is happy: YES", which conveys roughly the same as the comic. A longdesc to a description of the comic's typography and layout would be useful too.

<table class="invisible">

While I think a table is actually a not unreasonable element to use for the semantics here (two paragraphs being compared), the class is. "invisible" is quite clearly a presentational semantic. A better class would have been "comparative alegory", for example.

<div class="img"><img src="http://www.aaronsw.com/2002/gelernterNov7-1" alt="A file cabinet being thrown out the window" /><br />Jon Keegan, New York Times</div>

Here we see several problems back to back. First, why a <div>? This is a paragraph, not a section. Secondly, the poor alternate text. The image is purely decorative as far as I can tell: if I was reading this story to someone, the image would not convey any additional information. Appropriate alternate text is therefore probably "". Next, the <br> element. There is nothing inherent about that paragraph that semantically requires a line break: that the image and the text appear on different lines is purely presentational. Finally, the caption, which I presume is a citation, should be marked up using a <cite> element.


That class is redundant, given that the calendar has an ID.

Lists that don't use list markup

There's a list of entries in the "Archives" section of the menu sidebar, but it is delimited by <br> elements instead of using one of HTML's list elements. This same problem is repeated in several places, e.g. "Feeds I Read".

More presentational <br>

In the "What I'm Doing" section there are some very blatent cases of <br>s that have absolutely no semantic purpose. Most should be removed, or turned into paragraphs or lists.

alt="spread the dot" ... spread the dot

The image doesn't seem to actually be conveying anything, certainly not "spread the dot" since that text is immediately repeated after the dot. Maybe alt="" would be better?

<div class="footer"> <address>

According to Dan Connolly, the <address> element is a general footer element (unfortunately I can't find a reference to the e-mail thread in which we discussed this; maybe I am misremebering what he said), so that would make the <div> element here redundant. Unfortunately the HTML4 and XHTML2 specifications don't really back that up, so maybe a footer <div> is best. In any case, the copyright notice in the footer should be in a paragraph.

The main errors I would be concerned about are the inappropriate <br> and poor alternate texts, as they will be affecting accessibility today, but the other errors are not that minor either. I liked the stylesheets in general, as they tried to give the users the final word on most issues. The lengths are given in relative units, which is always good.

Overall, not quite as good an effort as Mark's, but still respectable. I look forward to seeing whether Aaron fixes all the errors as Mark did!

The next site I'll be examining is Mike Shaver's Web log. I don't know how many more of these I'm going to do, it depends largely on how long I can keep doing them without either getting bored or before I run out of new errors (not much point reporting the same errors over and over again). Thanks, by the way, to all the kind people who have been picking my own site to pieces by e-mail... I'm glad to say that don't think there are any outstanding issues.

2002-11-21 20:27 UTC Tag Soup: How UAs handle <x> <y> </x> </y>

HTML user agents have to be able to cope with invalid markup, such as unclosed tags, tags closing in the wrong order, and tags where they aren't allowed, if they are to render the existing Web. Rendering the existing Web is rather critical, because if you fail to do so, no user will adopt you. (A Web browser that can only load Dive Into Mark and the W3C site isn't much good to anyone.)

Unfortunately, the HTML specification does not define how to handle invalid markup. (XHTML does, because it uses XML, which goes to great lengths to define how to handle invalid markup. This is one of the best features of XHTML as far as most Web weenies are concerned — it forces pages to be syntactically correct!) Because it is undefined, Web browsers have each had to invent their own way of handling invalid content, while all trying to get the effects that are similar enough that users will think all is fine.

Let's take an example of invalid markup:

  <p>This is a sample test document.</p>
  a <em> b <address> c </em> d </address> e

How would you represent this in the DOM? This is not a trivial question. The DOM was designed to cope with well-formed documents, it has no facility for coping with elements that are half in another and half out of it. (And nor should it — after all, such documents are invalid.)

WinIE 6 tries to faithfully represent what the author wrote, to the point of making the DOM itself ill-formed, as described below. (Note that whitespace nodes and the text node child of the P element have been ignored for simplicity.)

The BODY element has five children: P, a, EM, d, and 

e. EM has two children, b and ADDRESS. ADDRESS has two children: c and d. P, a, EM, and d are siblings, b, 

ADDRESS, and e are siblings. c has no siblings. d is the child of two nodes (BODY and ADDRESS), but considers 

ADDRESS to be its parent.

This DOM quite close to what the author wrote — e is indeed a sibling of the ADDRESS element while being a child of the BODY element, and d is indeed a sibling of the EM element while being a child of the ADDRESS element. That d is a child of the BODY is, I think, an artifact of IE trying to get the second half of the ADDRESS element to be under the BODY while the first half is under the EM.

This DOM is probably showing us a lot more about the internals of Trident (WinIE's layout engine) than was intended. An implementation that internally uses a tree (which is basically what you need to correctly do CSS2) would be hard pressed to come up with a DOM like this.

Mozilla 1.2, on the other hand, tries to get the same effect, but without deviating from the rules of the DOM, namely that it has to be a tree:

The BODY element has five children: P, a, EM, 

ADDRESS, and e. This EM has one child, b. ADDRESS has two children: another EM, and d. This second EM has a 

child c. P, a, EM, ADDRESS, and e are siblings, b, EM, and d are siblings. c has no siblings.

The main feature of this treatement is that it has two EM nodes. Mozilla reaches the ADDRESS start tag and realises that EM elements cannot contain ADDRESS elements, so it closes the EM and reopens it inside the ADDRESS. Except in certain edge cases (like borders, explicit inheritance, and which selectors match which elements), the result of styling using CSS would be the same as in IE.

Opera 7 Beta has yet another interpretation. It attempts a mixture of the Mozilla and IE attempts: it tries to keep the DOM valid while not letting any of the elements in the document map to more than one node in the DOM:

The BODY element has four children: P, a, EM, and e. 

EM two children, b, and ADDRESS. ADDRESS also has two children: c, and d. P, a, EM, and e are siblings, b and 

ADDRESS are siblings, and c and d are siblings.

The basic principle at work here, it appears, is that the markup is fixed up by delaying any closing tags until after all other open elements have been closed, and no attempt is made to make the DOM follow the HTML DTD. So in this case, the ADDRESS element keeps the EM element open until its end tag. Opera's DOM is not the full story, though, as even though Opera puts the two text nodes (c and d) under the same element in the DOM, it still styles only the first text node (c) as if it was in the EM. (To some extent. It appears that some styles are propagated, and not others.)

The advantage of the techniques used by IE and Opera is that it makes it easier to cope with styling and scripting invalid markup. If you use the DOM to dynamically alter the EM element in IE's case, for instance, it'll happily affect the element throughout, around both b and c. In Mozilla's case, an attempt to change the EM element would only affect one of the parts at a time, so for example adding a border around the first EM would not put a border around text node c. Opera achieves the one-to-one mapping of markup to element as well, but doesn't restrict the EM to the text nodes that it contains in the markup.

The approaches used by Mozilla and Opera, though, get you a much more stable DOM. This is important for scripting: if you try to walk IE's DOM, you are likely to hit an infinite loop, because walking up the chain of parents for d (namely d → ADDRESS → EM) and then going to the next sibling will bring you straight back to d.

Mozilla's candid nature (what you see in the DOM is exactly what it's going to style) makes interpreting its results a lot easier. Opera's approach (providing a DOM but styling a slightly different model) is a lot more confusing.

The net result is that each model has its advantages and disadvantages, and they are about equally matched. And since HTML leaves this undefined, all of them are correct.

If you are interested in examining this further, I based this article on the results I obtained using a client side DOM browser I wrote and my legacy HTML parsing test 004 (which is not really a test, since there's no "correct behaviour"). That test also throws styling into the mix (I touched on this above). Amusingly, if you compare Mozilla's behaviour on tests 004 and 005 you find an obscure bug that has nothing to do with the markup being invalid (the colour changes even though the only difference is that 'font-variant' has been changed to 'font-weight').

Pingbacks: 1 2 3 4 5 6

2002-11-19 10:07 UTC data:text/plain;base64,SSBsb3ZlIGRhdGE6IFVSSXM=

Recently I've been using data: URIs quite a bit, so I went ahead and created myself a little data: URI kitchen.

This thing is great. However, I think I may have gotten a little carried away...

Pingbacks: 1 2

2002-11-19 08:27 UTC Markup Challenge: diveintomark.org (follow-up)

Woah. Mark fixed almost every single issue I raised! He's even serving his site up as application/xhtml+xml now! Way to go!