Alert: The following text describes the intentions. The current version of this document is just a collection of random draft sections.
This document is a self-contained extensive guide to HTML authoring, based on logical structuring. It emphasizes universal accessibility through various browsers and well as search engines. The language used is a subset of HTML 4.0 (Strict version).
It has often been argued that the content provider should only provide the content, such as the text and images, and other people should take care of adding HTML markup. The fallacy here is the idea that the content is a string of characters, to which some markup is then added. Well, it often happens that people convert Ascii files into HTML files. But please notice that this requires the recognition of the structure of the text, so that you can mark some text as heading elements, some paragraphs as block quotations, etc. (If there is a printed version of the text, with bolding and italics and so on, it may be useful here.)
Thus, if plain text is written first
by an author
and then markup is added by someone else,
then
the person who does the conversion needs to guess
the author's intentions as regards to the logical structure. Wouldn't
it be better to let the author express her or his intentions clearly
and uniquely in a very simple language? Such as writing
<H2>
and
</H2>
around a heading.
Professional (human) editors can often improve texts by adding (or suggesting) headings and emphases as well as rewording the text, deleting less important things, etc. But that's a different thing, and it requires special expertise.
Thus, the author should write the markup at the same time he writes the content. Markup isn't an extra spice added later but an essential ingredient of the food. It describes the structure. By the way, the author need not know everything about HTML. Contrary to popular belief, it is not obligatory to use every kind of element there are in HTML, not even know them. :-)
On the other hand, HTML authoring could, and perhaps often should, be separated from some technicalities of Web publishing. It's not difficult to learn to master the basic HTML markup. But what can be really difficult to learn (and do) to people who are not computer professionals is what to do with the HTML file once you've written it. FTP'ing, setting file protections and things like should perhaps be handled by people to whom they are easy. This phase might involve running the page through a validator, a linter, a link checker, and a spelling checker, fixing any obvious errors detected thereby and discussing with the author when necessary to find out the author's intentions.
Explain relations to HTTP, SGML, XML, CSS, Java, scripting languages.
There can be several levels of structure in an HTML document. The document might divide into sections, which are divided into subsections, etc., until we come to constructs like paragraphs. The paragraphs may contain text-level markup like emphasis on some words. But on top of such nested structures, there is a structure which will be called "cortex" here. (In anatomy, "cortex" refers to the outermost layer of human brain.)
Before starting to create an HTML document, you should make it clear to yourself why you are going to do it: What is the communicative purpose? What kind of message are you trying to deliver or what kind of interaction would you like to establish? Is there some particular audience for which it will be written? If you find such questions strange or too difficult, perhaps you should read my discussion So you want to create a home page?.
Naturally, answers to such questions can be refined later, and the mode of answering depends on the personal style of the author. Some people like to write things down while others have ideas which might never be formulated verbally. But you should be prepared to write some formulations, since statements of you intentions may constitute a very important part of the document or its so-called metadata.
An HTML document should begin with a so-called document type definition (abbr. DTD) which specifies the particular version of HTML used in the file. Although most browsers ignore it, it is crucial when the document is processed by a validator. The DTD is also very important if the document is processed by a general SGML browser, i.e. a program which can display a document written in any language defined using SGML, not just HTML documents.
The document type declaration for HTML 4.0 documents is the following:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">
The reason for using HTML 4.0 is that it contains the very useful
LANG attribute,
which did not belong to previous versions of HTML
(HTML 2.0
and
HTML 3.2)
and which should be used in all documents
to indicate the human language used in them.
Just in case you wonder: The letters
EN in the document type declaration
really stand for "English", but they refer to the language used when defining
the HTML language, not to the language used in your document.
Completely different mechanisms, most importantly
LANG attributes, are used for
specifying the language of a document.
Therefore,
do not change
EN there even if your document is not in English.
Having now discussed the document type declaration, we will just assume it's there and use the word "document" to refer to that part of an HTML document which follows that declaration.
HTML element
A document consists of
elements. An element is a structured part of a document, such as a heading,
a paragraph, or an emphasized sentence or word. Elements can be
nested:
an element
may contain other elements. In fact, the entire document is a single element,
an HTML element, which contains everything else.
(Notice that here "HTML element" means a specific element
with the name HTML
while in other contexts "HTML element" might refer to any element in the
HTML language.)
You begin a document with
<HTML LANG=lc>
and end it with
</HTML>
Here lc
is to be replaced by a two-letter code for the (main) language used
in the texts of the document. See below for explanations.
Generally an element consists of a start tag
and an end tag and anything between them,
the content of the element
(which may contain other elements).
The tags have the same form as the
HTML tags described above: a tag is enclosed
within the angle brackets
<
and
>
within which you have first the
/
sign, if the tag
is an end tag, then a tag name such as
HTML, optionally followed by one or more
attribute specifications (like LANG="en-US").
An attribute specification consists of
an attribute name, an equals sign, and an attribute value.
Each attribute has its own set of allowed values.
We write all attribute values in quotes, although in principle
the quotes might be omitted in some cases.
Tag names and
attribute names are case insensitive; e.g. LANG,
lang and Lang are completely equivalent
as attribute names. We will write tag names and attribute names in
upper case letters, since this usually makes it easier to distinguish
HTML markup from the text of a document.
For example, consider the following simple fragment of an HTML file:
<P>
<EM>An element may <STRONG>contain</STRONG>
another element.</EM>
Such nesting may occur <SPAN LANG="la">ad infinitum</SPAN>,
in principle.
</P>
Here we have a P element (a paragraph),
which contains, in addition to simple pieces of text, an
EM element (for emphasis) and a
SPAN element; the latter is used just in order to
specify, using the LANG="la" attribute, that some words
are in Latin. The EM element in turn contains a
STRONG element, which is used to give one word even
stronger emphasis.
A few elements consist of a start tag only, i.e.
neither content nor end tag is needed or allowed. They are called
empty elements,
which is somewhat misleading; they are not comparable to empty statements in
programming languages, for example.
One common "empty element" is
<BR> which indicates line break. It would have been
more logical to define things so that division into lines (if specified
in HTML at all) is indicated using an element for line, having a start
tag, content, and an end tag. But there are a few deviations from the
simple structural model in HTML.
The language code used
as a LANG attribute value
is the two-letter code as defined by the
ISO 639
standard.
Examples:
ar Arabic,
de German,
el Greek,
en English,
es Spanish,
fi Finnish,
fr French,
he Hebrew,
hi Hindi,
it Italian,
ja Japanese,
nl Dutch,
pt Portuguese,
ru Russian,
sa Sanskrit,
ur Urdu,
zh Chinese.
See document ISO 639 Languages and Dialects, and More by Michel Gélinas for additional information should as alternate names for the languages. See document Language Codes: ISO 639, Microsoft and Macintosh by Unicode for a draft list of language code correspondences between ISO codes, Microsoft codes, and Macintosh codes.
If you use a language to which no language code has been assigned,
you can use a code which begins with x-, such as
x-klingon. Naturally, you cannot expect program
processing your HTML files to recognize such codes.
It is possible to provide extended language information by appending a hyphen
and a subcode
to the primary language code mentioned above. Any two-letter
subcode is interpreted as a country code according to
ISO 3166.
(See document
Country Codes:
ISO 3166, Microsoft and Macintosh
by Unicode
for a draft list of country code correspondences
between ISO codes, Microsoft codes, and Macintosh codes.)
For example,
en-US means U.S. version of English,
and en-GB means British English.
It can be very useful to include a country code for languages where
the spellings of words may vary, as in English (e.g.
color
versus
colour
),
since language information can be utilized by spelling checkers.
Subcodes of other forms can be registered
(at IANA),
but
the registry is actually very small: it only contains subcodes
for the two versions of the Norwegian language.
The official recommendation
is to write language codes in lower case and country codes in upper
case, as in en-GB. But this is a recommendation only;
the codes are case-insensitive.
The language specification can be used by various programs which process your document for presentation or otherwise, such as spelling checkers, speech synthesizers, and search engines. Although not very widely utilized yet, this feature has great potential in it and should be used in all new documents.
If you documents contains texts which are in different language
than the main language, such as a French quotation in an otherwise
English document,
you can and should indicate that by providing a suitable
LANG specification for
that part of the document. For instance, you could precede
a French quotation with
<Q LANG="fr">
and end it with
</Q>.
For further information on language codes, consult RFC 1766.
Metainformation means information about information. Thus, for an HTML document, metainformation is information about the document, as opposite to information in the document. Although metainformation can be specified outside the document, too, e.g. in so-called HTTP headers, it can be embedded into the HTML document as well.
Metainformation is specified
in
TITLE, META and LINK
elements
before the body of a document.
It may sound strange to begin with writing overall titles and summaries. After all, the author might be starting a research project, for instance, and in such cases it is better not to know what the conclusions will be! (Otherwise it wouldn't be research at all.)
However, a summary is not the same as conclusions. For an ongoing research project, a summary tells what the research is about, what is the general approach and methodology, some hypotheses, and so on.
When starting the creation of a Web document, you should always try to write a summary first. The summary can later be refined or even completely changed as many times as needed. But if you can't write a summary at all, you should really do some thinking before starting a Web page creation project at all.
You should write three different summaries at the minimum:
For example, the top-level page of a laboratory of a university should have an external title which contains the name of the university at least as an abbreviation, in order to be understandable out of context, too. On the other hand, the overall heading could be just the name of the laboratory, if the page otherwise contains an indication of the context, such as the name or logo of the university which is a link to the main page of the university. The summary should express the major activities of the laboratory, with emphasis on its strongest areas of research and other key issues which may draw potential visitors' attention.
Technically, you should normally
TITLE element
H1 element) of the page
CONTENT attribute of
a META element with the attribute NAME="DESCRIPTION"
and
<TITLE>Low Temperature Laboratory at the Helsinki Univ.of Technology</TITLE>
<META NAME="DESCRIPTION" CONTENT=
"In the Low Temperature Laboratory of the Helsinki University of
Technology, the main fields of research are ultralow temperature physics,
neuromagnetic brain studies, and cryogenic application.">
<H1>Low Temperature Laboratory</H1>
<P>In the Low Temperature Laboratory (LTL)
of the
<A HREF="http://www.hut.fi/">Helsinki University of Technology</a>
the main fields of research are
ultralow temperature physics, neuromagnetic brain studies, and cryogenic
application.</P>
The reason for recommending such multitude of different summaries is that
each of them has its own purpose function, as described above.
In particular, as regards to the two presentations of the summary proper,
they are useful since some search engines pay attention to the
META element while others extract a summary from the
beginning of the body of the document. Moreover, a visible summary
under the main heading is often very useful to human readers, especially
to those who arrive at the page in some other manner than by using
search engines.
When plain text is typed into an HTML document, it is
to be understood as material to be formatted by
a browser (or otherwise processed by a user agent).
For example,
do not expect text to appear with the same line length
and division into lines
as you type it.
(The PRE element
and the TEXTAREA element
are the only exceptions.)
The basic rules for typing plain text are the following:
<
and
>
and
&
respectively.
Thus, for example, to produce the notation R&D you should type
R&D and to produce
a<b
type
a<b.
(which stands for "no-break space") between them.
Notice that you use it instead of a normal
space (or linebreak), not in addition to it!
For example, to prevent a line break between
"principle" and "7" in "principle 7", you would type it
as principle 7.
This is mainly an esthetic thing, and perhaps you find it too
irrelevant to care about.
If you only need Ascii characters, you need not bother about other character problems, except that in some cases your keyboard might not be able to produce some of special characters in Ascii. The Ascii characters are listed in the following:
! " # $ % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z [ \ ] ^ _
` a b c d e f g h i j k l m n o
p q r s t u v w x y z { | } ~
(Remember to present & and < and > as explained above.)
If you also need West European national characters such as ä (a umlaut, used e.g. in German and Swedish) and é (e with acute accent, used e.g. in French), you may have difficulties of some sort, partly because they have different internal representations in different computers. In that case you might start from my more technical notes on character issues in HTML.
This section describes the basic structure of HTML tables as well the use of simple tables. Section Tables will describe various additional features and illustrate them with more complicated examples.
To be written.Section Simple tables described the basic structure of HTML tables as well the use of simple tables. This section describes various additional features and illustrates them with more complicated examples.
To be written.In March 1989, Tim Berners-Lee wrote a proposal, Information Management: A Proposal, which outlined a client-server based hypertext information system. Various drafts for a hypertext markup language for the World Wide Web were written in subsequent years. It seems that the first attempts to write a formal specification were made in late 1992 and early 1993. In June 1993, an Internet draft Hypertext Markup Language (HTML) was published. Later the name "HTML 1.0" has been used to denote, rather vaguely, such early drafts and related practices.
However, no specification labeled "HTML 1.0" was ever approved. The first HTML specification which can be called "standard" in any sense was the HTML 2.0 specification, which became a proposed standard in November 1995. It describes and standardizes the practices of 1994. Conceivably, by November 1995 the discussion and implementation of new features was directed elsewhere.
Various sketchy proposals with many interesting ideas were written, such as HTML+ and HTML 3.0. In particular, an extensive and detailed document on HTML tables was written and released as RFC 1942 in May 1996. Some features were taken from such documents to the HTML 3.2 specification, which was approved in January 1997, but essentially HTML 3.2 reflects the state of HTML as implemented in popular browsers (like Netscape and Internet Explorer) in early 1996.
HTML 4.0 has a similar history.
It is a mixed collection consisting of HTML 3.2, extensions as implemented
in popular browsers, and some structural additions taken from old drafts
or newer proposals
such as the
Internationalization of the Hypertext Markup Language document
(RFC 2070; dated January 1997).
Consequently, the implementation status of
HTML 4.0
is very varying: those ingredients which were essentially
taken from popular browsers are supported by them, whereas structural
improvements such as the OBJECT element or the
extended set of character entity references are rather poorly supported
thus far.
For more details, please refer to HTML Overview by Brian Wilson
Unfortunately, the early history of HTML is poorly documented. Some information about the history of the World Wide Web in general can be found through the About The World Wide Web page of W3C, but it gives little information about HTML development. The HTML 3.0 draft contained an Acknowledgments section (partly edited from the Acknowledgments in the HMTL 2.0 specification) with some remarks on the history of HTML. The history archive of W3C contains a very confusing and random-looking collection of documents. On the other hand, the Publication History page at W3C contains a relatively good list of HTML specifications and drafts. Sadly, W3C seems to keep changing its site structure, so these pages might be moved any day.
&name; notations)
in HTML 4.0The elements described in this document form a subset of the elements defined in the HTML 4.0 Specification. Only the start tag of each element is presented here.
| Overall structure | |
|---|---|
<HTML LANG="langcode"> |
for specifying the language of the document |
<TITLE> | title associated with the document |
| Headings | |
<H1> | top-level heading |
<H2> | second-level heading |
<H3> | third-level heading |
<H4> | fourth-level heading |
| Blocks of text | |
<P> | normal paragraph |
<BLOCKQUOTE> | quotation from external source |
<ADDRESS> | address info about author |
<PRE> | preformatted tex |
| Lists | |
<UL> | unordered list |
<OL> | ordered list |
<LI> | list item |
<DL> | definition list |
<DT> | term in definition list |
<DD> | definition data for term |
| Classification of phrases (text markup) | |
<EM> | emphasized text |
<STRONG> | strongly emphasized text |
<Q> | quotation |
<CITE> | citation (title of a book or article or equivalent) |
<DFN> | occurrence of a term in its definition |
<CODE> | computer program code or equivalent |
<SAMP> | sample output from eg computer program |
<KBD> | text to be typed by a user |
<I> | text to be presented in italics |
<SMALL> | text to be presented in a font smaller than normal |
| Hypertext links | |
<A HREF="URL"> | link to a document |
<A HREF="URL#name"> | link to a named location in a document |
<A HREF="#name"> | link to a named location within the same document |
<A NAME="name"> | names a target location for links |
| Other elements | |
<IMG SRC="URL" ALT="text"> | image to be embedded |
<BR> | forced line break |
<HR> | change of topic (horizontal rule) |
An alphabetic list is to be added.
- describe why structured approach is needed; variation of browsers, style sheets, search engines, printing programs, analyzers etc.
- headings and html2ps
- A REL TITLE HREF, using TITLE unless the link text says it all
- prepare for the death of links: provide information for searching, too (eg title, author)
- using TARGET with named window for footnotes; logically, suggesting that the linked resource should be viewed in parallel with the current document
- make your "area" (directory, folder) of HTML documents well-organized from the beginning; use consistent naming scheme (avoiding names which cause problems in URLs) and divide the material into a hierarchy (of subdirectories)
- notations (HTML 2.0), (HTML 3.2), (HTML 4.0), which indicate which specs a feature belongs to and provides links appropriate parts; problem: what if e.g. an element was in 2.0 but its attributes were added in 3.2 and deprecated in 4.0?
- present examples with sampe renderings at least occasionally (e.g. Netscape 4.0 with defaults, IE with a nice style sheet, plus Lynx; what about speech samples?).
- char reportoire in anchor names: avoid need for encoding
- refer to http://pw1.netcom.com/~garbl1/writing.html and especially to http://www.cc.columbia.edu/acis/bartleby/strunk/