This document explains how to create a “customized” Document Type Definition (DTD) for a dialect of HTML. The purpose is to make it possible to use a markup validator for your HTML documents even if you intentionally deviate from official HTML specifications. This will help you find typos and other errors in documents. For information (and my views) on validation in general, see the document “HTML validation” is a good tool, but just a tool.
This document discusses “classic” versions of HTML, which are based on SGML. For XHTML, which is XML based, the structure of DTDs is somewhat different.
<!DOCTYPE HTML SYSTEM "dtdurl">
<!DOCTYPE HTML SYSTEM "http://www.cs.tut.fi/~jkorpela/html/tagsoup.dtd">
For a more detailed introduction, with examples, please refer to Using a Custom DTD by the WDG.
In principle, a DTD is SGML code and its Internet media
type would best be declared as
In practice, this confuses some browsers like Internet Explorer
when someone tries to open a DTD directly in a browser, so
text/plain might be a better
choice. And that’s a choice I’ve made.
You can start from HTML 4.01 Strict DTD with comments removed, or maybe HTML 4.01 Transitional DTD with comments removed. Removing the comments makes editing easier. Comments can be useful when reading a DTD, but they have no impact on validation. At the simplest, you might remove something because you have decided, or you have been told to, not to use some HTML 4.01 Strict features. You might decide that HTML 4.01 Strict is not strict enough for you, or you might want to avoid some of its constructs on a particular page or site. Using a restricted DTD for checking that such principles are obeyed is particularly useful when working with old large documents that may contain all kinds of markup.
For example, if you decide not
to use the
button element, it is sufficient to remove
its name and the preceding vertical bar “|”
from the following declaration in the DTD:
<!ENTITY % formctrl "INPUT | SELECT | TEXTAREA | LABEL | BUTTON">
It is not necessary to remove the declaration of the
though they can, of course, be removed too. The point is that by
removing the only reference to the element in other elements’
declarations you make it impossible to use the element validly.
Similarly, to disallow an attribute on an element, simply remove
the corresponding line from the
for the element.
If you wish to make an attribute required, just
search for its definition in an
Beware that the same attribute might be defined for different elements
and hence appear in different
There might be other complications, too.
For example, assume you wish to make
lang required on the
html element –
a good move that supports accessibility principles. The attribute list of
html is, however, defined as follows:
<!ATTLIST HTML %i18n;>
<!ENTITY % i18n "lang %LanguageCode; #IMPLIED dir (ltr|rtl) #IMPLIED ">
If you just changed the definition of the
attribute so that it has
#REQUIRED instead of
#IMPLIED, you would
make the attribute obligatory for all elements. That would
not make sense of course. As a simple solution, you could rewrite
!ATTLIST declaration for the
element as follows (dispensing with the
which is really just an auxiliary notation):
<!ATTLIST HTML lang %LanguageCode; #REQUIRED dir (ltr|rtl) #IMPLIED >
You can also make start and end tags required
when they are omissible according to official HTML specifications.
For example, the reason why you can omit
tags (and let browsers infer the end of a paragraph from
the start of an element that may not appear inside a
p element) is the declaration
<!ELEMENT P - O (%inline;)*>
Here the hyphen ‘-’ indicates that the start tag is not
omissible, whereas the letter
‘O’ indicates that the end tag is omissible.
If you replace ‘O’ by ‘-’, the end
tag becomes required. The following list contains all the
element declarations in HTML 4.01 Strict that permit start or end tag
omission, except for the elements with
declared content (which cannot have an end tag at all)
<!ELEMENT BODY O O (%block;|SCRIPT)+ +(INS|DEL)> <!ELEMENT P - O (%inline;)*> <!ELEMENT DT - O (%inline;)*> <!ELEMENT DD - O (%flow;)*> <!ELEMENT OL - - (LI)+> <!ELEMENT LI - O (%flow;)*> <!ELEMENT OPTION - O (#PCDATA)> <!ELEMENT THEAD - O (TR)+> <!ELEMENT TFOOT - O (TR)+> <!ELEMENT TBODY O O (TR)+> <!ELEMENT COLGROUP - O (COL)*> <!ELEMENT TR - O (TH|TD)+> <!ELEMENT (TH|TD) - O (%flow;)*> <!ELEMENT HEAD O O (%head.content;) +(%head.misc;)> <!ELEMENT HTML O O (%html.content;)>
Thus, these are the declarations that you need to consider if you wish to make end tags (and end tags) required more strictly than in HTML 4.01 Strict.
In order to add elements or attributes, you need to know a little bit more about SGML. (You might wish to check my short list of links to SGML material.) But you can largely just imitate the declarations in official HTML DTDs. However, you need to remember that to add an element, three changes are needed:
!ELEMENTdeclaration, which specifies the name of the element, omissibility of start and end tags, and the content model (i.e., a syntactic description of the contents of the element)
!ATTLISTdeclaration, which specifies the possible (and perhaps required attributes); in the rare case of an element that takes no attributes, this declaration is omitted
!ELEMENTdeclaration, so that the new element is allowed in a document in the first place.
It might be useful to include a declaration like the following in order to name the version of HTML you are using:
<!ENTITY % HTML.Version "HTML 4.01 Restricted">
Naturally, you would replace
HTML 4.01 Restricted by the name of your HTML version, for
HTML 4.01 Extended or
HTML for ACME.
In addition to being potentially useful as documentation,
this will make some validators give their reports in a better form,
since they include the markup language’s name as defined by
%HTML.Version into their reports.
The W3C would then say e.g.
”This page is not Valid HTML 4.01 Extended!”,
which might be better than saying just
”This page is not Valid !”.
This is debatable, though, since it might be argued that
a validator should really just say
whether a document is valid or not.
The SGML declaration for HTML defines a parameter called
GRPCNT, with value 64,
specifying the maximum number of tokens in a group.
This restricts, in particular, the amount of different
inline elements, since the names of these elements form a group
of tokens. This is especially serious since the
total number of those tokens is so large in HTML 4.01 Transitional
that adding even one element exceeds the limit.
You can often avoid this problem by removing at least as many
elements as you add. For example, there is hardly any use
elements in modern documents.
GRPCNT limit, and this is not likely
note on custom DTD support by Terje Bless in the www-validator list.)
But you can use the
which has an essentially larger value for the
My tagsoup DTD contains HTML 4.01 Transitional and a collection of more or less commonly used extensions, namely:
body, the attributes for setting margins (different attributes for different elements, see Marginal issues in Web page design)
framespacing(commonly used to remove borders between frames) and
bordercolor; this is actually relevant for a frameset document only but technically implemented here
galleryimgattribute for affecting IE’s odd behavior
wrapattribute, with values
hard; see notes on wrapping in textareas
bgsound, either in the
headelement or as inline markup (empty element)
blinkas inline markup (inline content allowed)
embedas inline markup (empty element)
keygenas inline markup (empty element)
listingas block element with CDATA content
marqueeas inline markup (inline content allowed)
nobras inline markup (inline content allowed)
noembedas block element
xmpas block element with CDATA content.
are not properly described (or describable) in the DTD,
since their original idea was that no markup other the
element’s own end tag is recognized. The DTD describes the
content as CDATA, but this means that all end tags are recognized.
I have omitted some nonstandard elements that have
been used to some extent, such as
nolayer. Although there is
some descriptive documentation about them, it is sketchy and
partly difficult to describe in a DTD. Moreover, these elements,
unlike some other nonstandard elements, are hardly interesting in
practical authoring even if you are looking for special effects,
unless you consider outdated browsers.
layer was supported by Netscape 4
but modern versions of Netscape ignore it, and most other browsers
never recognized it. Some elements, like
multicol, would be interesting if the support
to them were not so limited.
I have not modified the DTD to allow very common tagsoup
like the use of
font markup around tables and
other blocks. To describe such a soup in a DTD would probably
mean that the syntactic distinction between inline elements and
block elements is mostly removed.
Neither does the DTD include all commonly used extensions to attributes in elements that are themselves standard. Moreover, the attributes and other properties of the nonstandard elements included may vary quite a lot. That’s part of their being nonstandard.
For example, browsers that support
nobr may well let you
put blocks inside them, too. I have however defined them as inline
elements in my tagsoup DTD, since normally there is little point
in using them for anything but small fragments of text (even if
you think there is some point in using them in the first place).
I have also prepared a frameset DTD, which essentially just ”calls” the tagsoup DTD after defining a suitable entity so that the frameset alternative is picked up:
<!ENTITY % HTML.Frameset "INCLUDE"> <!ENTITY % HTML4.dtd SYSTEM "http://www.cs.tut.fi/~jkorpela/html/tagsoup.dtd"> %HTML4.dtd;
If you use frames, it might be a good idea to change the
<!ELEMENT FRAMESET - - ((FRAMESET|FRAME)+ & NOFRAMES?)>
by removing the question mark. This would make the
element required, reminding authors of the recommendation
to include alternate content for browsers and other user agents that
do not process frames.
If you wish to use the above-mentioned DTDs, it is better that you copy them and refer to your copy in your document type declarations. I may change the DTDs, perhaps adding some nonstandard elements or attributes.