HTML is too complex

(This is the ninth post in a series on the publishing industry’s new product categories.)

The syntax of HTML and XML—angle brackets and closing elements—isn’t complex. It’s tedious, but it isn’t complex. If the problem lay in the basic syntax we’d have an easy time fixing it. The problem with markup complexity lies in the underlying model. Or, in the lack of one. Simply put, HTML is a mess.

This is from an email sent by Matthew Thomas to the WhatWG mailing list (that list was at the time responsible for the development of HTML5) almost ten years ago. Everything it says is still true:

In response to the proposal that HTML5 add a host of semantic elements, each with no default rendering to distinguish it from other elements, Matthew predicted the following:

The A-list of Web developers will begin using all the elements correctly on their Weblogs, and they will feel good about it.

A greater number of Web developers will never use most of these elements, but they will replace all occurrences of <div> on their pages with <section> because it’s more “semantic” (just like they did with <em> for <i> and <strong> for <b>), and they will feel good about it.

The vast majority of article producers (Weblogs and online newspapers) will never use <article>, because there’s no visual or behavioral benefit from doing so. So <article> will never become a reliable way of dissecting or aggregating pages.

The number of knowledgable HTML authors, the proportion of HTML pages that are valid, and therefore the overall usefulness of the Web, will be less than it otherwise would have been because of HTML’s increased complexity.

I’d argue that his prediction, ten years ago, was pretty much spot on:

The A-list rewrote their own sites to use fancy HTML5 semantic elements, then wrote books, presented talks, and sold workshops teach people how to do the same.
The hangers on and wannabes try a bit but don’t use any of the elements except maybe header and footer, and possibly article after that was blessed as a generic sort of standalone content container instead of section. Most of the elements are regularly used incorrectly.
The vast majority don’t use any of the semantic elements unless it’s by accident like a thoughtless copy-paste.
The only reason why the proportion of valid HTML files has increased is because HTML5 retroactively blessed invalid files as valid, provided they wear the HTML5 doctype.

The web remains too unstructured for article to become a good way for ‘dissecting or aggregating pages’ as originally envisioned. The HTML5 outlining algorithm isn’t used by anybody (except the A-list gurus) and, even worse, supported by very few browsers or screenreaders.

As Matthew Thomas mentioned in the email above, unless there is an immediate visual or behavioural benefit to using an element, most people will ignore it. This is compounded by the angle-brackets mess of HTML. By completely separating design (CSS), behaviour (JS), and structure (HTML) the specification gods have taken away the context that would make it easier for us mere mortals to give our documents a meaningful structure.

That’s without getting into the problems with the syntax itself.

While the separation makes using HTML for documents and ebooks more difficult, it is essential for it becoming an app platform, which obviously now the web’s primary purpose.

(Most websites today are just web apps for delivering ads. They certainly aren’t made with readability in mind.)

There was a long period of time when the markup of most websites was unreadable because they used a mess of nested table tags to render the site. The markup was meaningless and complex. For a few years, though, after that, when you viewed the source of your average website, you would have seen relatively clean and nicely structured markup that most people could understand, even without specific knowledge about HTML. Google’s web crawlers loved simple, well-structured documents and so the web filled with them.

Now we’re back to seeing almost the same level of complexity and messiness in most web pages as we saw in the worst days of table-hacking. The semantic elements from HTML5 are largely unused. Those that are used such as <header> and <footer>, are used incorrectly because people misunderstand what they mean. Every page is riddled with div elements with opaque classes and IDs nested in a document structure that is more complex than many I saw in the table-layout days.

This escalating complexity is arguably one of the biggest ongoing issues in web development because it makes things like authorship, search engines, discoverability, and automation more difficult than it should.

You see, if the markup you assign to a piece of content has a specific meaning, you can write code that’s aware of this meaning. You make human meaning machine readable. This is useful if you want to make the text more searchable or if you want blind people to be able to hear it with their screenreaders. If the markup is too complex (both the underlying model and the markup syntax) to use properly, the humans won’t be able to do the markup properly, making the content’s meaning machine-opaque again. HTML5 has a big problem with markup complexity where even A-list developers have spent countless hours debating what the various new semantic elements actually mean.

Hint: They don’t mean what most of us assume they mean. Section, Article, Footer, Header, all of them have differences in meaning from what we’d assume from existing practice or basic understanding of English.

HTML5 is itself complex. Most developers can’t or won’t put in the effort to properly mark up their content semantically. EPUB3 and its ilk add even more complexity, more ‘semantic’ elements and attributes, all of them even more difficult to understand and harder to explain than the basic new semantic elements of HTML5.

Badly implemented complexity, such as in HTML5 and EPUB3, means we get all the pain and difficulty of escalating complexity, but with few of the benefits. Unfortunately, these are formats whose limitations we have to work around and surpass. They are a disadvantage on both the web and ebook industry. One of the tasks publishing has ahead is to try to neutralise that disadvantage.