Fix Up Your HTML with HTML Tidy and .NET
When standards change, your development efforts must often change with them. But change doesn't always have to be painful. If you're trying to upgrade your HTML pages to the latest standards, fix unclosed tags, find and fix deprecated features, and format all your Web pages consistently, HTML Tidy is just what the doctor ordered.  

advertisement
ou may never have heard of it, but HTML Tidy isn't new. HTML Tidy is a once-free but now open source application. It was originally written in C as a command-line executable by W3C employee Dave Raggett, before being taken over as an open source initiative in 2000. Somewhat characteristically of open source efforts, it's managed to shun the limelight, yet an ever increasing number of Web professionals rely on it daily to get their jobs done.


The principal reason it's so popular is because it combines syntactic, semantic, and stylistic advice in a single, highly configurable library. This means it can do more than simply fix unclosed or badly nested tags; it also has sufficient understanding of document structure to perform intelligent contextual cleanup—for example, culling empty paragraphs, removing duplicate attributes, or inlining blocks of text. Marry all this with W3C's recommendations for Web site accessibility and doctype compliance, add a basic understanding of browser differences, throw in a cautionary dollop of really verbose markup from your "smart" HTML export program of choice, and you end up with a package that can not only fix many of your stylistic faults; it can even tell you in plain English how to become a better HTML coder. The fact that it exports fully standards-compliant XHTML is just icing on the cake.

HTML Tidy's Genesis
Today HTML Tidy exists in many forms. The C library on which it was based has now been ported to most major operating systems (Windows, various flavors of Unix, BSD, MacOS, and DOS), as well as some minor ones (like the Atari 520ST's GEM o/s and the Amiga's OS3). You can download C++, Java, Delphi, Pascal, Perl, Python, and COM wrappers, and there's even a FrontPage 2000 plug-in.

There are currently two GUI implementations of HTML Tidy on the Windows platform: TidyGUI (see Figure 1) and TidyUI (see Figure 2).

 
Figure 1. HTML TidyGUI: The TidyGUI application provides access to a wealth of features and options.
 
Figure 2. The TidyUI Application: TidyUI is considerably more polished than TidyGUI.

Of the two, TidyUI is the more polished, with a lot of well thought out features (for example, the ability to tab between "tidied" and "original" documents to perform selective cut-and-paste, or preview the revised document in situ). Still, both expose a wealth of configurable options, some obvious, and some less so. Don't be disconcerted by the more esoteric ones; beyond their graphical interfaces, each program is really just a wrapper around the HTML Tidy library. Therefore, though each program may describe things differently, they end up doing the same things in the same way.

What neither can do, though, is run unattended to convert a set of HTML files into XHTML. Accordingly, I thought a simple batch converter like this might make a useful standalone project. It could offer a few basic options, but generally run with a usable set of defaults, and in the process I could find out just how much work was involved in redeploying the library under .NET.

  Next Page: Invoking HTML Tidy from .NET


Page 1: IntroductionPage 3: Designing the NETTidy Application
Page 2: Invoking HTML Tidy from .NETPage 4: A Word of Caution