Tag soup is history

HTML5's biggest and most important feature has gone unnoticed by many outside of browser vendor circles.

HTML5 has been hyped for so long now that sometimes it feels like it has looped through the hype cycle several times. We’ve all seen the hype around the media tags and its associated codec scandal. Then we had the hype around the semantic elements, some of which are practical, some of which aren’t (coughhgroupcough). Then there’s the canvas and SVG hoopla. And a bit of excitement around the form features. Offline seems to be diving into the trough. Web dev circles are full of talk around HTML5 and related tech.

But none of those things, fun as they might be, are HTML5’s most important feature.

Which is that HTML5 brings a new parsing model to the table. For the first time we have a standardised compromise between the overly strict behaviour of XML and the unordered mess that is tag soup.

The competitors

XML parsing

Almost everybody is familiar with how XML parsing behaves and almost nobody gets it right.

The model is simple: the parser croaks and gives up as soon as it encounters an error, no matter how tiny and stupid it is.

The benefits are obvious: implementation is simple and can be efficient.

The downside is equally obvious: it makes for an incredibly brittle technology that is uniquely fragile for something that’s supposed to serve as the core of how the web works. It’s like being an ‘intelligent designer’ and deciding to make a dinosaur’s backbone out of chalk and marshmallows.

(Of course, XML has other downsides in terms of its basic design, some of which are shared across most markup languages, but I’m limiting the scope here to the parsing model.)

Tag soup parsing

For most of the history of the web, HTML has been treated as tag soup; an unstructured mess of tags and strings. How it’s implemented varies but the end result is that browsers have been able to turn almost any old string, no matter how insane its structure is, and render it into a web page.

The downsides are numerous and obvious:

In short: tag soup is evil. It makes everybody’s life difficult all to accommodate a few lazy idiots who for some reason couldn’t stick to writing valid HTML.

(I could have gone into the shudder-inducing nightmare that was doctype switching, but fortunately most people don’t have to worry about that anymore. Be glad.)

So, it’s simple, go for XML, right?

No. XML sucks. It just sucks less than tag soup. All markup languages suck. They are complicated, hard to use, and far removed from how authoring and writing processes actually work.

XML parsing is also completely incompatible with the entirety of existing web content. Choosing XHTML for your technology stack means that you are, inevitably and inarguably, abandoning compatibility with existing web content. You can’t pitch your technology as, say, an archive format or transfer format for web content (to use a completely hypothetical example that is in no way referring to any existing piece of technology) if its incompatible with the entirety of the web.

You can’t claim to be compatible when your technology, as specified, objectively isn’t. But, unfortunately, this is something people keep doing. Even those who should know better.

Like most things in the web tech stack, HTML5 is a mess and quite a few of the messy parts are actually new inventions that none of HTML5’s predecessors suffered from. HTML5’s security model is a mess, the media elements are useless unless you standardise the codecs as well, the semantics of many of the new elements go completely counter to standard practice, the outline model directly contradicts best practice for accessibility, and most of the features for offline web apps are simply not fit for purpose.

But what it did right was that it defined a new parsing model, with clearly defined error handling and fallbacks, and a more forgiving syntax than XML can hope to offer. It makes life simpler for authors and the creators of documents in exchange for a little bit more effort on the part of the browser vendors.

(Which has a downside, of course, as a forgiving syntax means that escaping strings to be safe for inclusion in HTML is more complicated, but that’s another story for another time. And most of those problems go away if you always make sure to put quotes around your attribute values.)

HTML5 isn’t tag soup. It’s structured. What’s more, you can use the HTML5 parsing model to parse every single HTML document in existence so far and get a predictable response in all compatible parser implementations. That, in and of itself, is a huge benefit.

What it’s missing is the XML extensibility model, but that hasn’t prevented XML tech like RDFa lite or ARIA from being integrated into HTML5.

So, for a quick rundown of what we get with HTML5 parsing:

So, it’s the best thing since choco-mocha ice cream, right?

Well, yes. It doesn’t fix everything that’s wrong with HTML but it does comprehensively solve one thing that was wrong with markup as it has been used on the web.

If you are putting together a stack of technologies to address a use case or a market you should always choose HTML5 with its parsing model over XHTML5 with the XML parsing model. The only exceptions are if you want to deliberately make things difficult with a less forgiving parsing model, you don’t want compatibility with existing web content, or your extensibility requirements are so complicated that using RDFa or Microdata to extend HTML5 doesn’t address them.

In which case I’d argue you are probably making a hideous mistake and are introducing too much complexity and fragility into your platform.

If you don’t use the HTML5 parsing model, you don’t get to claim to be compatible with existing web content or pretend to be a portable archive format for the web (to use another completely hypothetical example that is not in any way based on existing formats or technologies).

And if you, to continue this completely made up fictional example that bears no resemblance to any real-world technology, combine XHTML with a large bundle of non-web extensions and then fork the CSS rendering model, then you end up with something that is almost entirely incompatible with the web in every single practical way. You won’t be able to share content, tools, or processes.

Not that anybody would be silly enough to do that in real life, since this is an utterly hypothetical example. Honest.

TL;DR: Use HTML5 and not XHTML5. Extend it with RDFa lite or Microdata if you have to have extensions. Keep it simple. Don’t invent things. People who describe HTML5 parsing as ‘tag soup’ are either lying or incompetent.