Composing Good HTML
Note: This document is available as both a single document (suitable for
printing) and a multi-part document
(more appropriate to hypertext). There is also a postscript version
available via
FTP, at jupiter.willamette.edu, as
/outgoing/jtilton/strict-html.ps. These multiple views are
automatically generated with a Perl script called "multiview".
The current edition of this document is available online at http://www.willamette.edu/html-composition/strict-html.html.
As the Web continues to explode in its own inimitable fashion, it is
becoming more and more important to write HTML that conforms to
certain guidelines. Specifically, with the current diversity of
clients for the Web (and we can only expect to see more!), it's become
important to write HTML that will look good on any
client, and not just on the specific client which the author may have
access to.
To that end, there are a few solutions. One approach is this one --
documents which point out common errors one might make in the
composition of HTML. The other approach is software based -- a
"lint"-like program for catching semantic errors in HTML, and perhaps
even correcting them.
The thing to bear in mind is that, if you follow these guidelines,
your document may not look as best as it possibly can on a particular
browser. However, it also will not look ugly on any browser, which is
the risk you take by disregarding these recommendations and tweaking
your HTML for, say, Mosaic. Unfortunately, Mosaic may render things
differently from Lynx which may render things differently from TkWWW,
etc, etc, etc. These guidelines, in essence, should ensure the
best fit across the space of all possible browsers, if you
get my drift.
This document does not purport to be a style guide,
or a beginner's manual to HTML. Fine documents
already exists for these purposes.
(Note: This document is fairly stable, but still open to
amendment. Please feel free to comment on that which is missing,
wrong, right, or silly. Especially, please point out
anywhere that I don't follow my own guidelines -- I'll slink back and
fix it, I promise! Thanks to everyone who's
already done so!)
Things contained in this section are good practices for the generation
of any HTML document. Specifically, this would include anything which
should routinely be done in the creation of documents for the benefit
of both reader and author.
It is a good idea to sign and date all documents served on the Web, so
that people viewing the documents can form some impression of the
authority of the document (i.e. how recent it is, and how reliable the
information provider is). For example, this
document has been signed.
Also, when dating a document, try to avoid ambiguous formats. For
example, both the month/day/year and day/month/year format are used on
the web -- so is "4/2/94" April 2 or February 4? A solution to this
is to use the name of the month (or an abbreviation).
Finally, the best way to sign a document is to to include a LINK
element of type "made" in your HEAD
element. For example:
<HEAD>
<TITLE>This is my Title</TITLE>
<LINK REV="made" HREF="mailto:author@some.site.org">
</HEAD>
For an example, look at the HTML source of this document.
This section details common errors in HTML composition, that may lead
to documents which are not fully device-independent. The behaviors of
these errors are undefined, so certain browsers may render them as
intended but not all browsers are guaranteed of doing
so. Therefore, these mistakes should be avoided, even if
your browser of choice renders your documents correctly.
Contents
This is probably the most prevalent kind of error, and is the number
one culprit in cases of ugly HTML rendering. If you fix nothing else,
fix these!
Perhaps the biggest misconception about the <P> element is that
it signals an end-of-paragraph, rather then a paragraph break. According
to the specification, "<P> is used between two pieces of
text which otherwise would be flowed together".
In most cases this is not important -- functionally, the <P>
serves as an end-of-paragraph marker. However, in certain contexts,
use of <P> should be avoided, such as directly before or after
any other element which already implies a paragraph break. To wit,
the <P> element should not be placed either
before or after the headings,
HR
(can I get a ruling on this? people don't handle HR consistently... X
Mosaic has no white space before or after, and Lynx appears to put
white space after),
ADDRESS,
BLOCKQUOTE,
or PRE.
It should also not be placed immediately before or after a list
element of any stripe. That is, a <P> should not be used to mark the
end-of-text for <LI>, <DT> or <DD>. These elements
already imply paragraph breaks.
Some clarifications on the above might be in order. One is the
the difficulties of rendering appropriate white space by a browser.
While it is true that all of the entities mentioned above imply a
paragraph break, this only occassionally means that they also imply
white space between sections -- this depends on the browser. So,
while you might feel inclined to add a <P> in order to fix white
space problems, please think twice and avoid it if you can.
Also, when using the glossary list (DL),
please try to avoid using multiple DD's (definitions of terms) in
order to provide multiple entries for a term (DT). Instead, use a
<P> marker between paragraphs in a definition. The use of a DD
(definition) without a matching DT (term) is illegal, although a DT
without a DD can be used without dire consequences.
All clear now?
Simply put, a character
reference and an entity
reference are ways to represent information that might otherwise
be interpreted as a markup tag. For instance, in order to represent
<P> in this text, I had to use <P>
in
my raw HTML. There are currently five
entities for this purpose in HTML, as well as several entities
which allow encoding of the ISO
Latin-1 Character Set.
The most common error in the use of references is to leave off the
trailing semicolon. Also, no additional spaces are needed before or
after the entity/character reference.
Another misunderstood aspect of HTML is in the composition of
URL's.
One grey area involves references to directories. It is
possible to request an index of a directory from an HTTP server. The
typical response from the server is to either return a pregenerated
index document (which is often the document "index.html" in the
referenced directory), or to construct an HTML document on the fly
which contains a listing of all files in the directory. However, when
making such a directory reference, it is important to make sure to
have a trailing slash on the URL. That is, if you were to
request the index of the directory which this document resides in, you
would want to refer to it as
http://www.willamette.edu/html-composition/, not as
http://www.willamette.edu/html-composition.
Some servers are able to catch these errors, and provide
redirection to the proper URL, but it's best to get the URL right in
the first place -- notably because not all browsers support
transparent redirection.
Problems can arise when the hostnames in URLs aren't fully qualified
In local networks, you can usually refer to your own machines simply
by their names -- for instance, here at Willamette we refer to our
local WWW server as "www". However, the server's FQDN (fully qualified
domain name) is "www.willamette.edu". The FQDN provides enough
information that any host, anywhere on the Internet, can find this
particular machine. (It's like trying to find all the Vermeers in New
York :).
What happens is that an HTML might construct a link that looks like
this:
<A HREF="http://www/~jtilton/metanoia/">Metanoia -- A
Change In Spirit</A>
which produces a link to Metanoia -- A Change In
Spirit that will only work for people in the local network that that
machine is on. A correct link would look like this, instead:
<A HREF="http://www.willamette.edu/~jtilton/metanoia/">Metanoia</A>
which would allow all of you who are interested in Metanoia to
actually follow the link.
This leads almost directly into:
Finally, a brief section on relative URLs. It is possible to
construct a "relative" URL, which gives you the following advantages:
- It's shorter.
- It makes a collection of documents which are linked together more
portable (easier to move from directory to directory, or server to server).
However, relative URLs can also break things.
A relative URL is a URL which doesn't contain all the necessary parts
of a "full" URL (scheme, host, path information). There's a large
number of things which might fit this description! The browser will
try to assume the parts that have been "left out" by using the information
from the URL of the document which contains the link. However, not
all browsers will make these assumptions in the same way. Here's a
short list of what's "safe" and "unsafe" (based on experience, and not
on a specification anywhere -- unfortunately).
- Safe: Same directory relative URL's
- A reference to a document in the same logical directory (such as
<A HREF="strict-html-gp.html">Good
Practices</A>) is safe. This kind of reference, roughly
speaking, contains no "/"'s.
- Safe: Same server relative URL's
- A reference to a document in the same server (such as <A HREF="/~jtilton/">Eric's
Hyplan</A>) is also safe. This kind of reference, roughly
speaking, will begin with a "/". (It will also be semi-absolute, in
that it starts at the top of that server's directory structure...)
- Unclear: Most other kinds of relative URLs
-
References such as <A HREF="~jtilton/euphonium.html"></A>
can be dangerous -- sometimes browsers will interpret that as meaning
"go up one directory level, find the directory '~jtilton', and then
find 'euphonium.html' in it." And sometimes they won't.
Currently, I don't understand this problem well enough to speak about
it. I will try and get a canonical answer when next I have the energy
to update this document.
- Unsafe: "file://localhost/..."
- It's also possible to have a reference to
"file://localhost/some/file/pathname". What this does is references
the file described on the local host of whoever is browsing the
document. Which is why a reference to <A
HREF="file://localhost/etc/motd"></A> will display the
message of the day on your machine, not the message of the day on my
machine. Unless you know what you are doing, these references will
really mess up your documents.
(This sub-section isn't written very well, I fear. If anyone has any
better copy, I'll gladly put it here instead. -et/April 7, 1994)
One common error that I used to make all the time (I use Marc
Andreesen's html-mode.el
for Emacs these days -- I had to learn Emacs, but now it's so much
easier to write HTML!) was to leave off a quote in my start tags. For
example, this reference to the euphonium,
king of instruments should look like:
<A
HREF="http://www.willamette.edu/~jtilton/euphonium.html">
but I would often use
<A
HREF="http://www.willamette.edu/~jtilton/euphonium.html>
instead. I suppose by the end of that huge URL, I'd forgotten it was
supposed to be quoted. The behaviour of browsers upon encountering
this varies -- some display a proper link, but you can't follow it,
while others actually eat up huge portions of the following text,
thinking it to be part of the URL.
Many of the HTML
elements contain information within them. For example,
<em>emphasized text</em>
would be rendered as
emphasized text. There is a start tag
(<EM>
), some content (which may include text, and
in some cases, other nested elements), and an end tag
(</EM>
, indicated by the </). A common mistake
is to miss the / in the end tag. All elements (except empty elements,
see next paragraph) must be terminated by an end tag -- otherwise,
undefined behavior may occur.
Some HTML elements may be empty, such as <P> and <HR>
(CERN provides an extended
discussion on element content). If this is the case, there is no
need for an end tag.
This section concentrates on mistakes in HTML authoring that are more problems
of aesthetics then problems of device-independence.
Contents
The section on
HTML elements in the HTML
specification indicates that HTML documents should not mix those
elements which belong in the HEAD of
a document with those which belong in the BODY.
The urgency of this suggestion is unclear, but it does make a certain
amount of common sense for readability of HTML code, and for
conformance with possible future browsers which may not support the
mixing of these elements. Essentially, it lacks serious style points >=).
In general, the use of white space around element tags should be
avoided. If white space immediately follows a start tag, for example,
the style changes implied by that element may be applied to the
initial space, as well. For instance, <A
HREF="http://www.willamette.edu/~jtilton/"> CZeCh THIZ 0uT
</A> would be rendered as CZeCh THIZ 0uT . On
some browsers, there may be white space around the anchor, which adds
unwanted unsightliness to the rendering, and may lessen the impact of
the document. (This comment really applies to white space immediately
following start tags, and immediately preceding end tags).
The HTML
specification points out that a heading
should not be more then one level below the heading which preceded it.
That is, <H3> should not follow <H1>, etc.
Also, it is pointed out that "a heading element implies all the font
changes, paragraph breaks before and after, and white space (for
example) necessary to render the heading". Extra highlighting
elements are discouraged, therefore.
When creating documents, make sure that your links are meaningful
-- that is, that they avoid
online-specific references, and that they don't
detract from readability. The text of your links should flow well
in the context of the rest of your text (especially avoid the click
here syndrome!), and your text should also be able to stand
alone as a printable
document.
In other words, avoid using sentences like, "You can find out more
information about cows by clicking here". (This
is also bad because it refers to "clicking", which assumes that
everyone is using a mouse with their browser!) A much better
alternative is "More
information about cows is available."
Since HTML (and also SGML) is designed to be a device independent language for
describing the content of documents, most of the elements within it
aren't intended to give direct control to the author over how the
final page layout will look. The major exceptions to this are in the
character
highlighting elements.
There are two types of character highlighting elements -- physical and
logical. The physical styles involve things like "italic font",
"boldface", etc; while the logical styles are things like "emphasis",
"citation", "strong", etc. It is strongly
recommended that you employ the logical styles rather than
the physical styles in your documents. Using <I></I> to
render text in italics will only be effective on those browsers which
are capable of displaying italics -- which all browsers are not
guaranteed to do. It is far better to encode semantic content -- to
describe things in terms of logical styles -- and then allow the
browser to display that semantic structure as best it can, given its
display capabilities.
So, instead of
- <I>italics</I>
- you might use <EM>emphasized</EM>, or a
<CITE>citation</CITE>, and instead of
- <B>bold</B>
- you might use <STRONG>strong</STRONG>.
This also leaves the possibilities open in the future for more
sophisticated uses of these semantic renderings, which have much more
inherent meaning than font styles like bold or italic.
(Unfortunately, the jury is still out to lunch on this one. One
argument against logical character styles is that it turns out to be a
bottomless pit, attempting to define logical styles for every
possibility. Physical styles, combined with the context of the text
in which they are placed, seem to provide a much richer set without a
huge number of tags. Oh, well. Use logical styles when you can, though.)
This section lists elements of HTML whose use should be avoided,
whether because the element is now obsolete, or because the element is
being deprecated (i.e. still supported, but its use is not recommended
and the element may eventually become obsolete).
Contents
Several elements of HTML are obsolete,
including PLAINTEXT, XMP, LISTING, HPx, and COMMENT. The first three
should be replaced with PRE; HP
(highlighted phrase) should be replaced with the character
highlighting elements; and COMMENT should be replaced with
<!-- blah blah blah -->
, the SGML comment
characters.
(Give me some time to fill this in. Like until who knows when?
>=)
There already exist documents on the Web which address this same
topic, and perhaps in more detail. For definitive reference
information you may wish to check the HTML
specification from CERN. For a more detailed discussion of HTML
composition style, you should also check the Style
Guide (especially the section on
device-independent
formatting), which is also from CERN.
If you're looking for a good document for learning the basics of HTML,
you will want to check out the
Beginner's Guide to HTML, from NCSA.
I'd like to thank all of you who have visited this document and commented
on it, suggesting fixes, clarification, and even new sections. You
know who you are (even if I managed to lose your addresses in the
flood of information) ! It is, in some senses, still a work in
progress and is always amenable to suggestion, modification, and
repair. I appreciate your help!
Last modified: Feb 9, 1995
James "Eric" Tilton, HTML Guru Wannabee,
jtilton@willamette.edu