Evan Goer is a senior technical writer for Chordiant Software. He realized that web standards were important the day that Netscape 6 became available internally while working at Sun Microsystems, and he’s been recovering from the shock ever since. Evan’s main regret these days is that he has no sense of typography or graphic design whatsoever. Sooner or later he’s going to take some classes, damnit.
The vast majority of today’s “XHTML” websites are invalid. No, don’t take my word for it — you can verify this for yourself by running the following experiment:
- Collect a random group of websites that declare an XHTML doctype.
- Run the home page of each site through the W3C validator.
- If the home page validates, validate at least three random secondary pages.
- Observe monstrously high failure rate. Lie down with cold compress on forehead. [Optional]
A naive observer might find these results surprising. After all, by definition all XML must be well-formed: this is what allows you to parse it with efficient off-the-shelf XML parsers, transform it with a standard transformation language, describe and manipulate its tree-like structure with a standard API, and much more. Malformed XML completely obviates these benefits. So why would anyone bother churning out pages upon pages of the stuff?
Why, indeed. Consider the humble browser parser (the browser component that is responsible for reading your markup). Modern browsers contain at least two parsers: one for XML and one or more for HTML. The lax HTML parser does its best to display pages no matter how mangled they are, while the strict XML parser chokes on the smallest error. Unlike its cousin, the XML parser is hard to trigger — you have to do something “special”, such as serving up an unusual MIME-type. Since invalid sites rarely bother to trigger the XML parser, their pages are parsed as HTML rather than XML. Thus we are protected from the vast wasteland of invalid XHTML out there, and the proprietors of these invalid sites are none the wiser.
But what’s so wrong with this picture, really? Let’s take a look at a real world example of what happens when XHTML is treated as XML rather than tag soup.
Canary in the Coal Mine
Jacques Distler had a problem: he wanted to share equations on string theory and quantum field theory with his fellow physicists on the web. Although Jacques had never needed to pay much attention to markup languages before, he did know how to write equations in a language called TeX. TeX wasn’t particularly web-friendly, but Jacques happened to have a tool that would convert TeX to a newfangled XML standard called MathML. MathML looked ideal for displaying equations on the web. How hard could it be to post a few equations?
Jacques discovered that the Mozilla browser could display MathML… if the MathML equations were embedded in a well-formed XML document. Fortunately, the W3C had thoughtfully provided an XML formulation of HTML called “XHTML 1.1” that allowed the embedding of inline MathML equations directly in web pages. Mozilla could display the equations if the page was valid “XHTML 1.1 plus MathML 2.0” and if the page triggered Mozilla’s internal XML parser. So Jacques dutifully constructed a valid template, configured his server to serve up his pages to Mozilla with the recommended MIME-type for XHTML, and voilà — the equations displayed beautifully!
However, Jacques soon discovered that he was living on a knife’s edge. Because he was using Mozilla’s unforgiving XML parser, one little mistake — a mismatched tag, an unescaped entity — would choke his visitor’s browser. And to his consternation, Jacques found that even if he wrote perfectly well-formed XHTML, other people were conspiring to mess up his web pages. By allowing comments, opening up trackbacks, and displaying snippets from alien RSS feeds, Jacques had opened up his site for any random visitor to crash that page with garbage markup. In order to produce 100% valid XHTML, Jacques realized that he had to “bulletproof” his site. Strip control characters. Validate comments. Batten the hatches. If he was going to take advantage of the power of XHTML, he would have to protect his site from his own mistakes and everyone else’s.
The X in XHTML
At first glance, the conversion from HTML to XHTML seems straightforward enough. Just change a couple of lines at the top of the page, set everything to lowercase, quote your attributes, close your tags, fix a few nits here and there… and presto! Forward compatibility achieved. Right?
Well, not really. It’s easy to do a one-off conversion to XHTML. The hard part is constructing a site so that it stays valid XML no matter what you or or anyone else throws at it. Jacques had to go through this process, and if you’re going to actually use XHTML for anything, you will too.
As for whether XHTML is indeed critical for forward compatibility, that’s still an open question. Let’s take this at face value. Consider the following statement:
The future of the Internet is XHTML.
If that statement is true, then an unpleasant truth follows: if an XHTML site isn’t bulletproof, then the site isn’t forward compatible. Again, the reason XHTML is “more advanced” and hence “more forward compatible” than HTML is that all-important “X”. XML’s inherent strictness is the key that enables new functionality in XHTML. Thus, if you violate this strictness and serve up tag-soup XHTML, you’ve accomplished nothing — you haven’t enabled your site to use any XHTML-specific features, either now or in the future. (Conversely, if the statement is false, then serving up tag-soup XHTML isn’t a disaster. It’s merely embarrassing.)
Note that when we speak of “forward compatibility”, we must be careful not to conflate “clean semantic markup”, “CSS layouts”, and “good accessibility” with XHTML itself. You can meet (or fail) all these goals whether you use XHTML or plain old HTML 4.01. The key technical question is: will the web of the future require functionality that is not present in HTML?
And then there’s XHTML. Here we’re stuck back in the Dark Ages. There are very few uniquely-XHTML applications available today, aside from a few edge cases like Jacques Distler. And let’s face it, most of us aren’t physicists. It’s a rare website that really needs MathML.
This lack of interesting XHTML applications makes it frustratingly difficult hard to understand what XHTML is all about — what it can do, what it requires of us. In fact, the whole thing is reminiscent of the state of CSS in 1998. You could read the CSS2 spec, but it was hard to imagine something like Fahrner Image Replacement when the tools of the day didn’t support basic CSS layouts in the first place. And even after the first tools became available (starting with IE5/Mac), it still took years for the community to come around to the idea that standards matter. Maintaining a 100% valid XHTML website requires a similar philosophical shift — no more cowboy hand-coding without validating every change, no more trusting alien content. Real XHTML is a whole new ballgame.
The lack of XHTML applications has a more insidious effect in that it raises the cost/benefit ratio for converting to XHTML. We can convert, but most of us won’t be doing anything with it — the benefit is low. As for the cost, that can be surprisingly high. For example, what’s the best way to deal with comments? Even if you manage to programmatically strip out all control characters and unescaped entities, you’re still faced with a tough decision:
- Disable comments entirely?
- Disallow markup in comments?
- Allow markup, but force all users to submit valid XHTML comments?
The first two options solve the problem easily, but they restrict your site’s functionality. As for the third option, it simultaneously restricts and enhances your site’s functionality. Your comment usability suffers, because you can’t just dash off a comment and submit it in one easy step. However, your users can now respond in other flavors of XML (if you let them), such as MathML or SVG. If you have a highly specialized audience, this functionality could be critical.
In short, switching your site to XHTML is not a no-brainer, and the trade-offs and decisions only multiply when you get down into the details. Each designer must weigh these costs and benefits individually. There is no “right” answer.
So what about you, Mr. Author-of-this-article? A fair question. For me, the benefits are just too low. I don’t have a strong technical reason to switch to XHTML, and so I’m sticking with the technology that meets my needs today: HTML 4.01 Strict. Don’t get me wrong: I respect those brave souls out there, the trailblazers. I’m just not one of them. If I wait patiently, the “must-have” XHTML applications will arrive eventually (along with the toolsets to deploy them). This game’s only just started.
- W3C Favelets Page. In particular, the “Validate This Page” favelet (or “bookmarklet”) will speed up your validation and testing dramatically. Mozilla users might want to check out one of the many helpful toolbars out there, such as the PNH Developer Toolbar.
- Sending XHTML as
text/htmlConsidered Harmful. Essential reading for anyone making the leap from HTML-soup to standards compliant XHTML.
- The Road to XHTML 2.0: MIME Types. Mark Pilgrim’s XML.com article on XHTML 2.0 migration also discusses XHTML 1.1 MIME-type issues and includes generic PHP code, Python code, and mod_rewrite rules for achieving MIME-type negotiation.
- Jacques Distler: Musings. If you’re using Moveable Type, Jacques’ site has all sorts of tidbits. In particular, see his posts on MIME-type negotiation, comment validation, and the StripControlChars plugin.
- MTValidate. A Moveable Type plugin that validates (X)HTML. Those of you who don’t use Moveable Type might find Agresticism.org’s VBScript validation approach to be useful.