Composing Good HTML

Note: This document is available as both a single document (suitable for printing) and a multi-part document (more appropriate to hypertext). There is also a postscript version available via FTP, at jupiter.willamette.edu, as /outgoing/jtilton/strict-html.ps. These multiple views are automatically generated with a Perl script called "multiview".

The current edition of this document is available online at http://www.willamette.edu/html-composition/strict-html.html.

Introduction

As the Web continues to explode in its own inimitable fashion, it is becoming more and more important to write HTML that conforms to certain guidelines. Specifically, with the current diversity of clients for the Web (and we can only expect to see more!), it's become important to write HTML that will look good on any client, and not just on the specific client which the author may have access to.

To that end, there are a few solutions. One approach is this one -- documents which point out common errors one might make in the composition of HTML. The other approach is software based -- a "lint"-like program for catching semantic errors in HTML, and perhaps even correcting them.

The thing to bear in mind is that, if you follow these guidelines, your document may not look as best as it possibly can on a particular browser. However, it also will not look ugly on any browser, which is the risk you take by disregarding these recommendations and tweaking your HTML for, say, Mosaic. Unfortunately, Mosaic may render things differently from Lynx which may render things differently from TkWWW, etc, etc, etc. These guidelines, in essence, should ensure the best fit across the space of all possible browsers, if you get my drift.

This document does not purport to be a style guide, or a beginner's manual to HTML. Fine documents already exists for these purposes.

(Note: This document is fairly stable, but still open to amendment. Please feel free to comment on that which is missing, wrong, right, or silly. Especially, please point out anywhere that I don't follow my own guidelines -- I'll slink back and fix it, I promise! Thanks to everyone who's already done so!)


Contents of this Document

Good Practices

Things contained in this section are good practices for the generation of any HTML document. Specifically, this would include anything which should routinely be done in the creation of documents for the benefit of both reader and author.

Signing Documents, and Time-Stamps

It is a good idea to sign and date all documents served on the Web, so that people viewing the documents can form some impression of the authority of the document (i.e. how recent it is, and how reliable the information provider is). For example, this document has been signed.

Also, when dating a document, try to avoid ambiguous formats. For example, both the month/day/year and day/month/year format are used on the web -- so is "4/2/94" April 2 or February 4? A solution to this is to use the name of the month (or an abbreviation).

Finally, the best way to sign a document is to to include a LINK element of type "made" in your HEAD element. For example:

<HEAD>
<TITLE>This is my Title</TITLE>
<LINK REV="made" HREF="mailto:author@some.site.org">
</HEAD>
For an example, look at the HTML source of this document.

Common Errors

This section details common errors in HTML composition, that may lead to documents which are not fully device-independent. The behaviors of these errors are undefined, so certain browsers may render them as intended but not all browsers are guaranteed of doing so. Therefore, these mistakes should be avoided, even if your browser of choice renders your documents correctly.

Contents

Paragraph Break Errors

This is probably the most prevalent kind of error, and is the number one culprit in cases of ugly HTML rendering. If you fix nothing else, fix these! Perhaps the biggest misconception about the <P> element is that it signals an end-of-paragraph, rather then a paragraph break. According to the specification, "<P> is used between two pieces of text which otherwise would be flowed together".

In most cases this is not important -- functionally, the <P> serves as an end-of-paragraph marker. However, in certain contexts, use of <P> should be avoided, such as directly before or after any other element which already implies a paragraph break. To wit, the <P> element should not be placed either before or after the headings, HR (can I get a ruling on this? people don't handle HR consistently... X Mosaic has no white space before or after, and Lynx appears to put white space after), ADDRESS, BLOCKQUOTE, or PRE.

It should also not be placed immediately before or after a list element of any stripe. That is, a <P> should not be used to mark the end-of-text for <LI>, <DT> or <DD>. These elements already imply paragraph breaks.

Caveats

Some clarifications on the above might be in order. One is the the difficulties of rendering appropriate white space by a browser. While it is true that all of the entities mentioned above imply a paragraph break, this only occassionally means that they also imply white space between sections -- this depends on the browser. So, while you might feel inclined to add a <P> in order to fix white space problems, please think twice and avoid it if you can.

Also, when using the glossary list (DL), please try to avoid using multiple DD's (definitions of terms) in order to provide multiple entries for a term (DT). Instead, use a <P> marker between paragraphs in a definition. The use of a DD (definition) without a matching DT (term) is illegal, although a DT without a DD can be used without dire consequences.

All clear now?

Character and Entity Reference Errors

Simply put, a character reference and an entity reference are ways to represent information that might otherwise be interpreted as a markup tag. For instance, in order to represent <P> in this text, I had to use &lt;P&gt; in my raw HTML. There are currently five entities for this purpose in HTML, as well as several entities which allow encoding of the ISO Latin-1 Character Set.

The most common error in the use of references is to leave off the trailing semicolon. Also, no additional spaces are needed before or after the entity/character reference.

URL Errors

Another misunderstood aspect of HTML is in the composition of URL's.

Directory Reference Errors

One grey area involves references to directories. It is possible to request an index of a directory from an HTTP server. The typical response from the server is to either return a pregenerated index document (which is often the document "index.html" in the referenced directory), or to construct an HTML document on the fly which contains a listing of all files in the directory. However, when making such a directory reference, it is important to make sure to have a trailing slash on the URL. That is, if you were to request the index of the directory which this document resides in, you would want to refer to it as http://www.willamette.edu/html-composition/, not as http://www.willamette.edu/html-composition.

Some servers are able to catch these errors, and provide redirection to the proper URL, but it's best to get the URL right in the first place -- notably because not all browsers support transparent redirection.

Not Using Fully Qualified Domain Names

Problems can arise when the hostnames in URLs aren't fully qualified In local networks, you can usually refer to your own machines simply by their names -- for instance, here at Willamette we refer to our local WWW server as "www". However, the server's FQDN (fully qualified domain name) is "www.willamette.edu". The FQDN provides enough information that any host, anywhere on the Internet, can find this particular machine. (It's like trying to find all the Vermeers in New York :).

What happens is that an HTML might construct a link that looks like this:

<A HREF="http://www/~jtilton/metanoia/">Metanoia -- A Change In Spirit</A>

which produces a link to Metanoia -- A Change In Spirit that will only work for people in the local network that that machine is on. A correct link would look like this, instead:

<A HREF="http://www.willamette.edu/~jtilton/metanoia/">Metanoia</A>

which would allow all of you who are interested in Metanoia to actually follow the link.

This leads almost directly into:

Improper Use of Relative URLs

Finally, a brief section on relative URLs. It is possible to construct a "relative" URL, which gives you the following advantages: However, relative URLs can also break things.

A relative URL is a URL which doesn't contain all the necessary parts of a "full" URL (scheme, host, path information). There's a large number of things which might fit this description! The browser will try to assume the parts that have been "left out" by using the information from the URL of the document which contains the link. However, not all browsers will make these assumptions in the same way. Here's a short list of what's "safe" and "unsafe" (based on experience, and not on a specification anywhere -- unfortunately).

Safe: Same directory relative URL's
A reference to a document in the same logical directory (such as <A HREF="strict-html-gp.html">Good Practices</A>) is safe. This kind of reference, roughly speaking, contains no "/"'s.
Safe: Same server relative URL's
A reference to a document in the same server (such as <A HREF="/~jtilton/">Eric's Hyplan</A>) is also safe. This kind of reference, roughly speaking, will begin with a "/". (It will also be semi-absolute, in that it starts at the top of that server's directory structure...)
Unclear: Most other kinds of relative URLs
References such as <A HREF="~jtilton/euphonium.html"></A> can be dangerous -- sometimes browsers will interpret that as meaning "go up one directory level, find the directory '~jtilton', and then find 'euphonium.html' in it." And sometimes they won't.

Currently, I don't understand this problem well enough to speak about it. I will try and get a canonical answer when next I have the energy to update this document.

Unsafe: "file://localhost/..."
It's also possible to have a reference to "file://localhost/some/file/pathname". What this does is references the file described on the local host of whoever is browsing the document. Which is why a reference to <A HREF="file://localhost/etc/motd"></A> will display the message of the day on your machine, not the message of the day on my machine. Unless you know what you are doing, these references will really mess up your documents.
(This sub-section isn't written very well, I fear. If anyone has any better copy, I'll gladly put it here instead. -et/April 7, 1994)

Missing Quotes in Start Tags

One common error that I used to make all the time (I use Marc Andreesen's html-mode.el for Emacs these days -- I had to learn Emacs, but now it's so much easier to write HTML!) was to leave off a quote in my start tags. For example, this reference to the euphonium, king of instruments should look like:

<A HREF="http://www.willamette.edu/~jtilton/euphonium.html">

but I would often use

<A HREF="http://www.willamette.edu/~jtilton/euphonium.html>

instead. I suppose by the end of that huge URL, I'd forgotten it was supposed to be quoted. The behaviour of browsers upon encountering this varies -- some display a proper link, but you can't follow it, while others actually eat up huge portions of the following text, thinking it to be part of the URL.

Missed End Tags

Many of the HTML elements contain information within them. For example, <em>emphasized text</em> would be rendered as emphasized text. There is a start tag (<EM>), some content (which may include text, and in some cases, other nested elements), and an end tag (</EM>, indicated by the </). A common mistake is to miss the / in the end tag. All elements (except empty elements, see next paragraph) must be terminated by an end tag -- otherwise, undefined behavior may occur.

Some HTML elements may be empty, such as <P> and <HR> (CERN provides an extended discussion on element content). If this is the case, there is no need for an end tag.

Things to Avoid

This section concentrates on mistakes in HTML authoring that are more problems of aesthetics then problems of device-independence.

Contents

Mixing HEAD and BODY Elements

The section on HTML elements in the HTML specification indicates that HTML documents should not mix those elements which belong in the HEAD of a document with those which belong in the BODY. The urgency of this suggestion is unclear, but it does make a certain amount of common sense for readability of HTML code, and for conformance with possible future browsers which may not support the mixing of these elements. Essentially, it lacks serious style points >=).

Using White Space Around Element Tags

In general, the use of white space around element tags should be avoided. If white space immediately follows a start tag, for example, the style changes implied by that element may be applied to the initial space, as well. For instance, <A HREF="http://www.willamette.edu/~jtilton/"> CZeCh THIZ 0uT </A> would be rendered as CZeCh THIZ 0uT . On some browsers, there may be white space around the anchor, which adds unwanted unsightliness to the rendering, and may lessen the impact of the document. (This comment really applies to white space immediately following start tags, and immediately preceding end tags).

Heading Usage

The HTML specification points out that a heading should not be more then one level below the heading which preceded it. That is, <H3> should not follow <H1>, etc.

Also, it is pointed out that "a heading element implies all the font changes, paragraph breaks before and after, and white space (for example) necessary to render the heading". Extra highlighting elements are discouraged, therefore.

Meaningless Link Text

When creating documents, make sure that your links are meaningful -- that is, that they avoid online-specific references, and that they don't detract from readability. The text of your links should flow well in the context of the rest of your text (especially avoid the click here syndrome!), and your text should also be able to stand alone as a printable document.

In other words, avoid using sentences like, "You can find out more information about cows by clicking here". (This is also bad because it refers to "clicking", which assumes that everyone is using a mouse with their browser!) A much better alternative is "More information about cows is available."

Physical vs. Logical Character Emphasis

Since HTML (and also SGML) is designed to be a device independent language for describing the content of documents, most of the elements within it aren't intended to give direct control to the author over how the final page layout will look. The major exceptions to this are in the character highlighting elements.

There are two types of character highlighting elements -- physical and logical. The physical styles involve things like "italic font", "boldface", etc; while the logical styles are things like "emphasis", "citation", "strong", etc. It is strongly recommended that you employ the logical styles rather than the physical styles in your documents. Using <I></I> to render text in italics will only be effective on those browsers which are capable of displaying italics -- which all browsers are not guaranteed to do. It is far better to encode semantic content -- to describe things in terms of logical styles -- and then allow the browser to display that semantic structure as best it can, given its display capabilities.

So, instead of

<I>italics</I>
you might use <EM>emphasized</EM>, or a <CITE>citation</CITE>, and instead of
<B>bold</B>
you might use <STRONG>strong</STRONG>.
This also leaves the possibilities open in the future for more sophisticated uses of these semantic renderings, which have much more inherent meaning than font styles like bold or italic.

(Unfortunately, the jury is still out to lunch on this one. One argument against logical character styles is that it turns out to be a bottomless pit, attempting to define logical styles for every possibility. Physical styles, combined with the context of the text in which they are placed, seem to provide a much richer set without a huge number of tags. Oh, well. Use logical styles when you can, though.)

Deprecated and Obsolete Elements

This section lists elements of HTML whose use should be avoided, whether because the element is now obsolete, or because the element is being deprecated (i.e. still supported, but its use is not recommended and the element may eventually become obsolete).

Contents

Obsolete Elements

Several elements of HTML are obsolete, including PLAINTEXT, XMP, LISTING, HPx, and COMMENT. The first three should be replaced with PRE; HP (highlighted phrase) should be replaced with the character highlighting elements; and COMMENT should be replaced with <!-- blah blah blah -->, the SGML comment characters.

Deprecated Elements

(Give me some time to fill this in. Like until who knows when? >=)

For More Information

There already exist documents on the Web which address this same topic, and perhaps in more detail. For definitive reference information you may wish to check the HTML specification from CERN. For a more detailed discussion of HTML composition style, you should also check the Style Guide (especially the section on device-independent formatting), which is also from CERN.

If you're looking for a good document for learning the basics of HTML, you will want to check out the Beginner's Guide to HTML, from NCSA.

Acknowledgements

I'd like to thank all of you who have visited this document and commented on it, suggesting fixes, clarification, and even new sections. You know who you are (even if I managed to lose your addresses in the flood of information) ! It is, in some senses, still a work in progress and is always amenable to suggestion, modification, and repair. I appreciate your help!
Last modified: Feb 9, 1995

James "Eric" Tilton, HTML Guru Wannabee, jtilton@willamette.edu