HTML 5: Good News / Bad News

Written on Thursday, July 16th, 2009 at 1445 by David
Filed under General Information.

Given the amount of discussion swirling around HTML 5, I had to put in my 2¢ worth. Here are my areas of concern:

Rulebooks

As I understand it, here’s the situation with the current markup languages:

HTML 4 is markup that conforms to the rules of SGML. Because there are optional closing tags, you need an external guide of some sort–specifically, a DTD–to find out if a document is valid.

XHTML 1.0 is the same set of elements and attributes, written to conform to the rules of XML. You can tell if a stand-alone document is well-formed because XML requires all opening tags to have closing tags. You still need some sort of external guide, written in Relax NG or XML Schema, to tell if the document is valid. (See Wikipedia for a discussion of well-formedness and validity.)

HTML 5 is a set of elements and attributes, written to conform to…well…according to the spec, “This specification defines an abstract language for describing documents and applications, and some APIs for interacting with in-memory representations of resources that use this language….There are various concrete syntaxes that can be used to transmit resources that use this abstract language, two of which are defined in this specification.” The two syntaxes are HTML 5 and XHTML 5.

The good news: I like the idea of an abstract language that can be transmitted in variety of formats; that’s a clever, flexible approach. The bad news: there is no rulebook. The FAQ says that the HTML 5 “syntax is inspired by the SGML syntax…, bits of XML…, and reality of deployed content on the Web.” There’s no DTD, so, in effect, the spec is the rulebook. This gives HTML 5 an “ad hoc” feel, no matter how well it may have been designed.

Things are better on the XHTML side; since it’s XML based, you can check documents for well-formedness, but there’s no schema for the XHTML serialization. I’d feel a lot better if such a schema were part of the specification.

Note: This doesn’t mean you can’t validate your documents; there is a validator, but its XHTML schema isn’t “official.”

Bad Markup

The good news: the HTML 5 specification tells exactly how to handle improperly nested elements or nesting of flow content inside of phrasing content (for example, <i><h1>Oops</h1></i>). This means that compliant browsers will give consistent results when handed bad markup.

The bad news: the specification doesn’t encourage sloppy markup, but by specifying that user agents will fix it, it doesn’t discourage bad markup, either. The penalty for bad markup is paid on the client side when the browser has to spend time repairing the markup.

HTML 5 vs. XHTML 5

The bad news: as I understand it, the plan is to move people to the HTML 5 syntax. The FAQ says that “the trailing slash syntax has been permitted on void elements in HTML in order to ease migration from XHTML 1.0 to HTML5.” But I don’t want to go to the neither-fish-nor-fowl HTML 5 syntax.The good news: there is a drop-in replacement for the XML parser that will let me use tools like XSLT with HTML5, but given the widespread use of XML tools, why not encourage XHTML 5 instead?

The Janus Markup

Just as the mythological Janus looked both backwards and forwards, so does HTML 5. The good news: parts of HTML 5 are very forward-looking; specifically the <canvas> element, which is very open-ended. The bad news: many of the other new elements are designed to more efficiently represent the web as it is today. But in three years, when HTML 5 will be at candidate recommendation status, what will the web’s needs be, and what will a web site look like? (Consider: what did web sites look like three years ago?) See this site for an excellent discussion of forwards-looking versus backwards-looking specifications.

Summary

The good news: HTML 5 has once again sparked people’s interest the direction that markup should take. The bad news: I think HTML 5 is not the right direction. Your mileage may vary.

7 Responses to “HTML 5: Good News / Bad News”

  1. zeldman Says:

    As always, you cut to the heart of an issue and explain it so anyone can understand.

  2. Sam Howat Says:

    I was actually just complaining at how the pissing matches that are going on regarding html 5 are taking away from the valid points of discussion. Thanks for this post!

  3. river brandon Says:

    i think that’s the clearest explanation of the fundamental differences between html and xhtml that i’ve ever read. and it lays out the path being chosen in a very concise manner.

    i find it impossible to disagree with you, and i hope your voice, along with others, can help to change the course towards that of xhtml 5. rules (constraints) are good, and help us to be our best. so much the better when we know what they are.

    thank you.

  4. Ms2ger Says:

    “the spec is the rulebook”: Well, yes, isn’t that what a spec is for? In the end, if a DTD is being used, the spec still defines the DTD. On the other hand, usually the DTD describes the syntax only partially. For example, the target attribute in HTML 4 contains “CDATA” according to the DTD, but the spec further limits the contents to strings that begin with an alphabetic character—but there’s no way to know that from the DTD. So why should the criteria for a document be put in two different places rather than one?

    “Why not encourage XHTML 5 instead?”: Easy. IE doesn’t support it. There is no such thing as XHTML 5 sent as text/html, because that is HTML 5 by definition.

  5. David Says:

    Ms2ger: having a machine-readable schema (and Relax NG would let you specify that the target attribute requires an initial alpha, BTW) gives you the capability to automate validation. If WHATWG doesn’t provide a schema, creating one is left to people outside the group, and whatever schema comes from them might not have the official WHATWG “seal of approval.”

    As for IE, well, it’s IE. What can I say?

  6. HTML 5 Weekly Review #3 | Jeff Siarto Says:

    [...] HTML 5: Good News/Bad News If you’re still a lit­tle con­fused and over­whelmed with every­thing that’s going on with HTML 5, XHTML and future of the web, David at ODF Tools has writ­ten a post that may help clear things up. [...]

  7. Lars Gunther Says:

    I am sorry to say it, but there is a lot of misinformation in this article and some of the comments.

    #1. HTML <= 4 has never been an implementation of SGML where it counts in practice – in browsers. HTML 5 just makes this explicit. And believe me, you would not like to allow all true SGML syntax like <span/content/

    Let me duplicate that if you strip tags:

    <span/content/

    And when Mozilla supported SGML comment rules, and nobody else did, it was just plain confusing.

    Don’t blame the HTML 5 spec for saying what is true. Blame every browser vendor for never having implemented SGML parsing. And blame every developer for having produced markup that would break or having produced CMSs that would be breached if true SGML parsing had been allowed.

    #2: There is a rulebook. There are conformance rules. They are just not expressed in an DTD. The rules are actually so detailed that a DTD is insufficient. As is RelaxNG and XML Schema.

    There is an unofficial validator, primarily based on relaxNG, but with added Java code as well.

    There will also be an official validator, for both serializations. It’s just not there yet, since the spec is still being worked on.

    #3: XHTML is part of the spec. In fact the true differences between HTML and XHTML has never before been so clearly explained. The parsing rules for XHTML are equally detailed.

    http://wiki.whatwg.org/wiki/HTML_vs._XHTML

    #4: Browsers have always handled bad markup. Nothing new about that, except that they will do it the same way and won’t have to reverse engineer each other. The spec is very clear that author requirements (being conformant) and browser requirements (being tolerant) are two different things.

    Are you actually suggesting that browser should stop being tolerant and become unable to show 90 % of all content that already is out on the Internet. Nobody would use such a browser!

    The spec very clearly discourages bad markup. In fact it goes to even greater length than the current spec does in saying what is good practice and the new conformance rules are harder to meet than the old validation rules.

    #5: Why use HTML instead of XHTML? How about the fact that every major JavaScript library and about 99 % of all minor ones can’t handle XHTML. How about the fact that MSIE can’t? How about the fact that most stylesheets will break?

    The W3C standards have always mandated that even though text/html is not prohibited for XHTML documents, they should (as in rfc-language, are 100 % required to) be able to work sent out as true XHTML as well. But guess what? 99 % of all supposedly XHTML pages break. Even if they get manage too keep all malformed UTF-8 away or forbidden characters (most don’t), even if they manage to nestle all tags correctly, keep tag names and attributes lowercase and correctly quote all attribute values (most don’t), even if they correctly avoid all loose ampersands (most don’t), oh yes, and even if they avoid all named entities that are not part of the XML specification – they still will break, since they will look like shite when the CSS malfunction, and they will not react properly to events when the scripts do the same.

    But if you like, there is nothing in the spec that forces you not to write polyglot documents. I probably will. One reason is to leverage current XML tools server side. Even Henri Sivonen does that! When the new parser becomes ubiquitous the need will abate though, since just about every XML tool will work equally well on HTML.

    See this discussion for some more myth busting:

    http://itpastorn.blogspot.com/2009/06/validation-and-doctype-myths-and.html

    And this one for a realistic appreciation about the necessity about being backwards compatible:

    http://itpastorn.blogspot.com/2009/07/no-backwards-compatibility-xhtml-2-was.html

Leave a Reply