Skip to: Navigation | Content | Sidebar | Footer

Weblog Entry

Markup: Bulletproof XHTML

September 03, 2003

Second Voice Icon: MarkupEvan Goer is a senior technical writer for Chordiant Software. He realized that web standards were important the day that Netscape 6 became available internally while working at Sun Microsystems, and he’s been recovering from the shock ever since. Evan’s main regret these days is that he has no sense of typography or graphic design whatsoever. Sooner or later he’s going to take some classes, damnit.

The vast majority of today’s “XHTML” websites are invalid. No, don’t take my word for it — you can verify this for yourself by running the following experiment:

  1. Collect a random group of websites that declare an XHTML doctype.
  2. Run the home page of each site through the W3C validator.
  3. If the home page validates, validate at least three random secondary pages.
  4. Observe monstrously high failure rate. Lie down with cold compress on forehead. [Optional]

A naive observer might find these results surprising. After all, by definition all XML must be well-formed: this is what allows you to parse it with efficient off-the-shelf XML parsers, transform it with a standard transformation language, describe and manipulate its tree-like structure with a standard API, and much more. Malformed XML completely obviates these benefits. So why would anyone bother churning out pages upon pages of the stuff?

Why, indeed. Consider the humble browser parser (the browser component that is responsible for reading your markup). Modern browsers contain at least two parsers: one for XML and one or more for HTML. The lax HTML parser does its best to display pages no matter how mangled they are, while the strict XML parser chokes on the smallest error. Unlike its cousin, the XML parser is hard to trigger — you have to do something “special”, such as serving up an unusual MIME-type. Since invalid sites rarely bother to trigger the XML parser, their pages are parsed as HTML rather than XML. Thus we are protected from the vast wasteland of invalid XHTML out there, and the proprietors of these invalid sites are none the wiser.

But what’s so wrong with this picture, really? Let’s take a look at a real world example of what happens when XHTML is treated as XML rather than tag soup.

Canary in the Coal Mine

Jacques Distler had a problem: he wanted to share equations on string theory and quantum field theory with his fellow physicists on the web. Although Jacques had never needed to pay much attention to markup languages before, he did know how to write equations in a language called TeX. TeX wasn’t particularly web-friendly, but Jacques happened to have a tool that would convert TeX to a newfangled XML standard called MathML. MathML looked ideal for displaying equations on the web. How hard could it be to post a few equations?

Jacques discovered that the Mozilla browser could display MathML… if the MathML equations were embedded in a well-formed XML document. Fortunately, the W3C had thoughtfully provided an XML formulation of HTML called “XHTML 1.1” that allowed the embedding of inline MathML equations directly in web pages. Mozilla could display the equations if the page was valid “XHTML 1.1 plus MathML 2.0” and if the page triggered Mozilla’s internal XML parser. So Jacques dutifully constructed a valid template, configured his server to serve up his pages to Mozilla with the recommended MIME-type for XHTML, and voilà — the equations displayed beautifully!

However, Jacques soon discovered that he was living on a knife’s edge. Because he was using Mozilla’s unforgiving XML parser, one little mistake — a mismatched tag, an unescaped entity — would choke his visitor’s browser. And to his consternation, Jacques found that even if he wrote perfectly well-formed XHTML, other people were conspiring to mess up his web pages. By allowing comments, opening up trackbacks, and displaying snippets from alien RSS feeds, Jacques had opened up his site for any random visitor to crash that page with garbage markup. In order to produce 100% valid XHTML, Jacques realized that he had to “bulletproof” his site. Strip control characters. Validate comments. Batten the hatches. If he was going to take advantage of the power of XHTML, he would have to protect his site from his own mistakes and everyone else’s.

The X in XHTML

At first glance, the conversion from HTML to XHTML seems straightforward enough. Just change a couple of lines at the top of the page, set everything to lowercase, quote your attributes, close your tags, fix a few nits here and there… and presto! Forward compatibility achieved. Right?

Well, not really. It’s easy to do a one-off conversion to XHTML. The hard part is constructing a site so that it stays valid XML no matter what you or or anyone else throws at it. Jacques had to go through this process, and if you’re going to actually use XHTML for anything, you will too.

As for whether XHTML is indeed critical for forward compatibility, that’s still an open question. Let’s take this at face value. Consider the following statement:

The future of the Internet is XHTML.

If that statement is true, then an unpleasant truth follows: if an XHTML site isn’t bulletproof, then the site isn’t forward compatible. Again, the reason XHTML is “more advanced” and hence “more forward compatible” than HTML is that all-important “X”. XML’s inherent strictness is the key that enables new functionality in XHTML. Thus, if you violate this strictness and serve up tag-soup XHTML, you’ve accomplished nothing — you haven’t enabled your site to use any XHTML-specific features, either now or in the future. (Conversely, if the statement is false, then serving up tag-soup XHTML isn’t a disaster. It’s merely embarrassing.)

Note that when we speak of “forward compatibility”, we must be careful not to conflate “clean semantic markup”, “CSS layouts”, and “good accessibility” with XHTML itself. You can meet (or fail) all these goals whether you use XHTML or plain old HTML 4.01. The key technical question is: will the web of the future require functionality that is not present in HTML?

Strange Days

These are strange days for markup geeks. The good news is that we’ve emerged from the Dark Ages of web development into a Renaissance of sorts. A profusion of browsers have bloomed with reasonably good standards support, and even the current “baseline” browser can handle mid-level CSS layouts and DOM-based JavaScript.

And then there’s XHTML. Here we’re stuck back in the Dark Ages. There are very few uniquely-XHTML applications available today, aside from a few edge cases like Jacques Distler. And let’s face it, most of us aren’t physicists. It’s a rare website that really needs MathML.

This lack of interesting XHTML applications makes it frustratingly difficult hard to understand what XHTML is all about — what it can do, what it requires of us. In fact, the whole thing is reminiscent of the state of CSS in 1998. You could read the CSS2 spec, but it was hard to imagine something like Fahrner Image Replacement when the tools of the day didn’t support basic CSS layouts in the first place. And even after the first tools became available (starting with IE5/Mac), it still took years for the community to come around to the idea that standards matter. Maintaining a 100% valid XHTML website requires a similar philosophical shift — no more cowboy hand-coding without validating every change, no more trusting alien content. Real XHTML is a whole new ballgame.

The lack of XHTML applications has a more insidious effect in that it raises the cost/benefit ratio for converting to XHTML. We can convert, but most of us won’t be doing anything with it — the benefit is low. As for the cost, that can be surprisingly high. For example, what’s the best way to deal with comments? Even if you manage to programmatically strip out all control characters and unescaped entities, you’re still faced with a tough decision:

  • Disable comments entirely?
  • Disallow markup in comments?
  • Allow markup, but force all users to submit valid XHTML comments?

The first two options solve the problem easily, but they restrict your site’s functionality. As for the third option, it simultaneously restricts and enhances your site’s functionality. Your comment usability suffers, because you can’t just dash off a comment and submit it in one easy step. However, your users can now respond in other flavors of XML (if you let them), such as MathML or SVG. If you have a highly specialized audience, this functionality could be critical.

In short, switching your site to XHTML is not a no-brainer, and the trade-offs and decisions only multiply when you get down into the details. Each designer must weigh these costs and benefits individually. There is no “right” answer.

So what about you, Mr. Author-of-this-article? A fair question. For me, the benefits are just too low. I don’t have a strong technical reason to switch to XHTML, and so I’m sticking with the technology that meets my needs today: HTML 4.01 Strict. Don’t get me wrong: I respect those brave souls out there, the trailblazers. I’m just not one of them. If I wait patiently, the “must-have” XHTML applications will arrive eventually (along with the toolsets to deploy them). This game’s only just started.


Reader Comments

Dave S. says:
September 03, 07h

One more note: Evan is the originator and maintainer of the X-Philes, a growing list of sites that have successfully passed a stringent list of criteria. With a few 1.0 Strict exceptions, everyone on the list is serving up valid XHTML 1.1.

MikeyC says:
September 03, 08h

“For me, the benefits are just too low. I donít have a strong technical reason to switch to XHTML, and so Iím sticking with the technology that meets my needs today: HTML 4.01 Strict.”

Same here. I’ve gotten funny looks from some designers who are surprised to find that my site isn’t marked-up in XHTML, just because I don’t have a use for it in 2003. “But it’s so easy…you just have to quote your attributes, close your tags, etc..!!!” Like duh…I *know* that and, in fact, I do those things now.

So many designers are converting their sites to XHTML with absolutely no reason for doing so. I don’t have a problem with this, except if you are going to take the plunge at least do it right: make sure your pages validate and it wouldn’t hurt if you at least served the pages with the proper MIME type to those UserAgents that can understand it.

“whatís the best way to deal with comments?”

Just a thought: perhaps embed your comments page into your site through use of the object tag so that if someone “breaks” the comments page it doesn’t break the whole page. This probably isn’t the best solution, mind you…

rick says:
September 03, 09h

I guess I’ll need to work a little more on my site before I re-release it, then! This was a very well written article and I would like to thank you for writing it. :)

September 03, 11h

For some new sites I’m developing, I’ve chosen the "why not?" approach. (See )

The masochist in me enjoys having PHP set the mime header, then using Mozilla to test and build the site. Anything wrong, and it fails right there, without having to send it to for validation.

Then I get the comfort of knowing that JUST IN CASE there is some benefit of, or need for XHTML a few years down the road, my site is already ready to go.

I think it’s a fun challenge.

Keith says:
September 03, 11h

Great piece Evan. You make some excellent and very down to Earth points. I had started with a lengthy comment, but that turned into a post of my own. If interested read on here:

Can’t wait to see whose “voice” you have for us next, Dave.

Gary F says:
September 04, 01h

“As for the third option, it simultaneously restricts and enhances your siteís functionality. Your comment usability suffers, because you canít just dash off a comment and submit it in one easy step.”

I have to disagree with this. It’s very possible (and easy) to have a system that you can “just dash off a comment and submit it in one easy step” and have it appear as XHTML. All the tools to protect a site are already written.

Bob says:
September 04, 02h

I don’t have the option of goofing around with the Apache mod-rewrite, and the only PHP option I’ve seen[1] still displays the Content-type as “text/html” rather than “application/xhtml+xml” …

Everything else I’ve read blithely assumes that I know exactly how to display the appropriate Content-type to the appropriate browsers. Sadly, I do not. Can someone point me in the right direction here? I seem to recall seeing a more thorough tutorial on content-type sniffing somewhere, but can’t find it now.



Chris E. says:
September 04, 02h

For a comments section, you generally are going to want to allow users only a small subset of the html markup. If you allow only a small subset that means your going to have to parse the markup so that only that small subset works. Supposing that we are doing this, we can then just tighten up our filters on what we want to allow a user to enter into the comments section.

Honestly, I don’t think this is much more of a burden then anything else, you would do for a web page with comments. If your not rolling your own comments interface then there is probably something out there that is easily hackable, and if you are rolling your own then you know what you need to do. I think that this particular problem is slightly overstated.

Dave S. says:
September 04, 02h

Bob, and anyone else interested in proper MIME-type dishing can take a look at this thread from last month:

ASP and PHP code examples abound. Note that a) Smarty Pants royally screws the quotes, and b) this was before I enabled auto-linking of URIs. Let me know if any of those examples are too badly mangled, and I’ll go into the comment and try to rescue the code.

September 04, 02h

Great work!

“We can convert, but most of us wonít be doing anything with it ó the benefit is low.”

Exactly! If it works, don’t fix it.

One thought though, as XHTML is no harder than HTML, and DOM based scripting is in fact quite easy, there’s no reason for NOT doing __NEW__ sites in XHTML.

Being a bit of a control-freak about output, I prefer using XML/XSLT to completely control ALL output sent to the browsers. This way I control the output, not the users. (I think that the developers of Apache httpd use the phrase “defensive programming”).

I start out by assuming that everyone uses Moz/Opera/Safari, and then afterwards cater to the majority of users that, sadly, still use IE. Just create another transformation for whatever buggy browser comes along. Eventually the “fix-errors” stylesheets/scripts/transformations becomes obsolete (not being served), without me having to do any work.

Whenever someone requires access to the site/application with a specific browser (IE5.0, NN4.x, PDA, whatever), creating a new transformation is easy.

This approach works very well for me, even in projects where the “target browser” is in fact IE6. The rewards are numerous, especially when working on “public sites” that have serious accessibility and cross-browser requirements.

September 04, 03h

Talking about the user submitted stuff like comments, etc. - I prefer using wiki style markup in comments, e.g. **text** for bold text, [[text]] for links, etc.; the script that parses the input text replaces html entities and substitutes the wiki markup to the valid (x)html markup. Of course, functionality is lost but why would one ever need that every visitor to one’s page can leave a comment full of html garbage?

One can strip comments of html tags for security reasons (f. ex. someone can use actually any tag and, when specified, with a style attribute render the page monstrous) but then extended functionality is lost as well.

Michael says:
September 04, 03h

We have a mixed environment at work where we use XML/XSLT to output HTML. The transform can be done either client side or server side depending on who’s accessing the site and what security is set for the specific page. Recently we wanted to switch to XHTML for all of our output. Unfortunately we ran into some significant problems with the parsers in the form of tag minimisation.

Running output method XML on the client side (MSXML3) seemed to work fine. On the server using XALAN/XERCES was another story. It seems that a large problem with the server side parsers we’ve tried is the inability for the parser to LEAVE THE EMPTY TAGS ALONE despite their validity. So if I output an empty div “<div></div>”, which is perfectly valid, it get’s minimised to “<div/>”. Likewise with script tags, textareas, and a host of others. This makes for unusable content on the browser end. We attempted to hack together our own solution and label it with the XSLT 2.0 output XHTML instead of XML, only to watch the client side parser cack on that.

So the problems essentially are how to get the parser to
2) Get the parsers to use output method XHTML on both sides (client/server).

We seemed to have no problems client side, it was only recently when we switched to doing server side transforms that we had to revert to plain HTML output. And I’ve seen hundreds of posts in the newsgroups with others have identical problems.

I know it’s important for my business to have valid XHTML output, but how to get there with our current toolset?

MikeyC says:
September 04, 05h

“One thought though, as XHTML is no harder than HTML, and DOM based scripting is in fact quite easy, thereís no reason for NOT doing __NEW__ sites in XHTML.”

I disagree. I assume you are talking about XHTML with a text/html MIME type (malformed HTML in actuality), as doing XHTML *properly* is, in fact, a lot more challenging than HTML. One little error won’t break an HTML page. What does the DOM have to do with any of this? You can do DOM based scripting in HTML.

September 04, 06h

For the record: DOM based scripting is harder (than stuff like document.write), especially in XML.

All the difficulties when transforming to XHTML1.1:

MikeyC says:
September 04, 06h

Anne: “For the record: DOM based scripting is harder (than stuff like document.write), especially in XML.”

Which really just illustrates the point. Those who claim “XHTML is no harder than HTML” probably aren’t using it in anything but the most superficial of ways, and therefore reaping no actual benefits, but instead simply sending out malformed HTML. Which is especially ironic when they prominantly display “Valid XHTML” buttons.

Dave S. says:
September 04, 07h

Michael - now, I’m out of my league here, but I’ve gone through a process alike to, but not identical to yours. The problem in my case was that my weblog tool (Movable Type) is configured to use XML entities (&#8220;, &#8212;, etc.) that, when copied and pasted into a comment, would come out as the literal character (“, —, etc.). This broke my validation.

My solution was using a regular expression plugin (hopefully your server-side language is capable of them without much tinkering) to search for the invalid characters and replace them with the entities. Theoretically, you could do the same on your server as each page is dished up (therefore, after your server-side parser has done its damage). Search for any string starting with “<” and ending with ” />”, minus the kosher <br />s and <hr />s etc., and replace at will.

Dave S. says:
September 04, 07h

And as you can see by the blasted “& amp;#8220;” I’m still working on it.

Dave S. says:
September 04, 08h

Odd how these things come in waves. I stumbled across a complimentary piece to this one that WaSP published today.

Bob says:
September 04, 08h

That’s the one I was looking for, Dave. Thanks! I even dug around in some of the previous posts here in my search, but that wasn’t one of them. ‘preciate it! :-)

Seamus says:
September 04, 08h

Your article could not have come at a better time because it was just this week I started playing around with MT and I have questions about valid XHMTL comments with it. Because even if I disabled HTML input, they can still throw in an & to mess up your code.

Evan says:
September 04, 08h

Great comments so far, guys! I’ll just echo one of MikeyC’s points. Creating perfectly valid HTML 4.01 Strict is about as difficult as creating perfectly valid XHTML – after all, the validator mostly squawks about the same issues for both languages, doesn’t it? But of course the key difference is that the consequences of failure in XHTML are far more severe. Bulletproof HTML is a nice-to-have, while Bulletproof XHTML is a fundamental requirement.

Michael says:
September 04, 08h

Thanks Dave that might be a valid work around. We use JAVA so it should be simple enough to try. I’d just have to see what the time cost is for doing it.

Dave S. says:
September 04, 08h

Yeah, time costs are mostly irrelevant to me because MT does caching. Only when the file is saved does it have to run the list of regex’s - each page served is completely static. Perhaps there’s a further suggestion for you - caching.

September 06, 05h

A good idea for creating valid XHTML (e.g. without Tag Soup) is using XML / XSLT (e.g. with Tomcat) as the Basis (and that is “beautiful” as well).

Dejan says:
September 06, 06h

I just can’t find a reason compelling enough to give up HTML 4.01 in favour of XHTML (in any version). The arguments pushed by W3C (at seem to me all but empty. The first one, about future modular extensibility of XHTML, will stay a daydream for a log while; the second one, about reusing the same content for alternate user agents (WAP or else), simply doesn’t survive a reality text. I’ve done my fair part of coding WML and SMIL when there was still hope something useful would come out of them, and I can wholeheartedly state that, with both WML and SMIL being XML-derived languages, the user-agent compatibility issue has not advanced one bit (in the specific case of WML, it’s by far worse than for web browsers). The whole idea of keeping content in a flat (XML) file, by the way, falls on its nose as soon as you’ve got a quantity of stuff stored. What I actually do is keep the content in a MySQL database and send it to the client using user-agent specific PHP templates (read: Smarty). It actually works without a hitch even for RSS and SOAP, it’s a lot faster than keeping tons of stuff in flat files only to start each time that XML-XSLT-XHTML and what else code soup, and you get the additional benefit of searching through it at database speed.
I’ve found time and again one more lame reason pushed for switching: to begin learning XML-based languages… I can barely believe somebody actually published such nonsense. When you start with a new language it won’t be a lack of familiarity with the sintax to slow you, it will be the lack of readable documentation…

Michael says:
September 06, 09h

I think that XHTML is progress towards a goal. While not every platform and every tool handles XHTML perfectly, more and more tools and better support is emerging. Along with that better documentation and examples of use.

It’s easy for some people to look at XHTML and see it as a failure. This typically occurs because they want everything NOW! Because everyone’s not using it it must be a failure right? Since IE has poor support and MS isn’t going to upgrade their browser to support it properly it must be doomed right? Wrong. XHTML Basic will become incorporated into WAP 2 replacing WML. Which brings up another sticking point for many people.

Many developers don’t understand that there are different specifications involved in that blanket term XHTML and that even as we get more people using XHTML 1.0/1.1 there’s also XHTML Basic, and then XHTML 2.0 on the horizon. So while XHTML speaks to a certain set of guidelines and tag use within that technology there are specialty modules that speak to specific devices when wanted. So essentially I could use the XHTML 1.1 to author a page that both a desktop user and (with care) a mobile phone user could use, but if I wanted to specifically target and tailor the view to the mobile user I’d use XHTML Basic. XHTML Basic is essentially XHTML 1 stripped of tags that cause problems in WAP devices (like frames and some other problematic tags).

jacob says:
September 08, 09h

Just a thought: You or I may not reap any overwhelming benefits to offering an XHTML site, but somebody else may — say, somebody who wants to screen scrape your site, and would prefer an XML parser to regular expressions. I think about the possible benefits of XHTML and elegant, ‘semantic’ markup every time I look at… (yeah, I know that various machine-readable weather services already exist. that’s just an example.)

Kiffin says:
September 09, 07h

There is a fine line between using given standards correctly and impacting too much the world around us (which is not yet ready for full compliance). You have to weigh carefully the benefits along with the disadvantages. Unfortunately, nowadays the decisions are still based on some mythical profit margin and/or maximal cost effectiveness which so often defy the imagination altogether. Yes, but…

jacob says:
September 09, 08h

Evan - right, that was my point. I want bulletproof XHTML to be commonplace. It seems like most of the discussions concerning the benefits of XHTML have concluded that there are no tangible benefits unless you happen to be Jacques Distler. That seems to me to focus exclusively on XHTML authors and not XHTML consumers. A little conscientious XHTML markup could go a long way towards the potential usefulness of a web site — and if you make it easy for your site to be parsed, who know what some clever scripter will do with it?

Gary F says:
September 09, 08h

“The whole idea of keeping content in a flat (XML) file, by the way, falls on its nose as soon as youíve got a quantity of stuff stored.”

Funny, because my site uses nothing but XML for storing content (in the file system). You just have to be intelligent about how you use it. After doing some projections, I can happily hold thousands of files with no degradation in speed. I suspect (although I haven’t tested it yet) that because of the way I’m storing these files, it can increase considerably before hitting minor speed problems. The same can be said of any storage method.

Evan says:
September 09, 09h

Michael - very interesting that XHTML Basic is replacing WML in WAP 2. Current support for XHTML Basic in the real world is absolutely wretched… but perhaps now this will be changing in the next few years? Most of the wireless device companies seem blissfully unaware of web standards, but you never know. Hmmm.

Jacob - sure, that’s another interesting use of the technology. However, your point only reinforces the need for bulletproof XHTML. If an XHTML site isn’t 100% valid all-the-time, scraping it with an XML parser is a waste of time. If you don’t know for *certain* that the site you’re scraping is bulletproof, a regex scraper is really the only sensible choice.

September 10, 01h

Well, Jacob, “Who knows what a clever scripter will do?” is a pretty thin reed on which to hang your hopes. A more cogent rationale is simply to say, “I want to use XHTML to learn how to bullet-proof it. That way, when a compelling application comes along, I will be *ready*.”

I’ve already found *my* compelling application. For you and the rest of the “X-Philes”, I hope yours comes along soon. And when it does, I hope y’all are ready to serve up the most feature-rich XHTML website possible, with absolutely bulletproof content.

The two are not mutually-exclusive. And your job is to figure out how to reconcile them.

jacob says:
September 11, 10h

It may seem like a thin reed to you, but I see it as a reflection of a justified hope in bottom-up innovation (clever random individuals who bring us things like rather than top-down innovation (W3C who brings us things like SOAP). There’s power, Jacques, wonder-working power, in the goodness and idealism and faith of the random scripting people.

Raj says:
September 11, 10h

Do you all mean to say that Zeldman is wrong? Even in his latest book, he claims to start using XHTML and its very simple to do that transition. What do you guys have to say about it? The reason I’m asking this is because I have started implementing his thoughts… stop me if I’m wrong.

September 11, 11h

There is *absolutely nothing* in Zeldman’s book, or in any of his work that could not be done in HTML 4.01. What he is preaching – semantically-clean markup, with CSS for styling – is not in any way related to the XML nature of XHTML.

He is, to use Mark Pilgrim’s phrase, selling “XHTML, the brand” rather than “XHTML, the technology.” He’d like you to associate those good coding habits with using lowercase tags and trailing slashes. If that works for you, fine.

He doesn’t expect you to (indeed, he warns against) send out XHTML with the right MIME type, or, indeed, to use any of the XMLish features of XHTML. When you get down to it, it doesn’t even *really* matter if your site validates.

In the end of the day, if you are going to send out XHTML tag soup, the transition (from sending out HTML tag soup) is, indeed, an easy one.

September 12, 12h

Hi there,

Great article. I thought I should add some feedback from a XHTML user. Like some other commenters, it turned itself into a post of its own on my blog. See to understand how XHTML has saved my life once, and how comments can be managed not to break validation (careful Microsoft employees, my solution contains chunks of free software that may hurt you ;-)

Leif says:
October 24, 02h

“Evanís main regret these days is that he has no sense of typography or graphic design whatsoever.”

If you only ever read one book about typography, make it this one: The Elements of Typographic Style by Robert Bringhurst.