devxlogo

Fix Up Your HTML with HTML Tidy and .NET

Fix Up Your HTML with HTML Tidy and .NET

ou may never have heard of it, but HTML Tidy isn’t new. HTML Tidy is a once-free but now open source application. It was originally written in C as a command-line executable by W3C employee Dave Raggett, before being taken over as an open source initiative in 2000. Somewhat characteristically of open source efforts, it’s managed to shun the limelight, yet an ever increasing number of Web professionals rely on it daily to get their jobs done.

The principal reason it’s so popular is because it combines syntactic, semantic, and stylistic advice in a single, highly configurable library. This means it can do more than simply fix unclosed or badly nested tags; it also has sufficient understanding of document structure to perform intelligent contextual cleanup?for example, culling empty paragraphs, removing duplicate attributes, or inlining blocks of text. Marry all this with W3C’s recommendations for Web site accessibility and doctype compliance, add a basic understanding of browser differences, throw in a cautionary dollop of really verbose markup from your “smart” HTML export program of choice, and you end up with a package that can not only fix many of your stylistic faults; it can even tell you in plain English how to become a better HTML coder. The fact that it exports fully standards-compliant XHTML is just icing on the cake.

HTML Tidy’s Genesis
Today HTML Tidy exists in many forms. The C library on which it was based has now been ported to most major operating systems (Windows, various flavors of Unix, BSD, MacOS, and DOS), as well as some minor ones (like the Atari 520ST’s GEM o/s and the Amiga’s OS3). You can download C++, Java, Delphi, Pascal, Perl, Python, and COM wrappers, and there’s even a FrontPage 2000 plug-in.

There are currently two GUI implementations of HTML Tidy on the Windows platform: TidyGUI (see Figure 1) and TidyUI (see Figure 2).

Invoking HTML Tidy from .NET
There are basically two sensible ways to get HTML Tidy working on the .NET platform. You can either use P/Invoke to call directly into the DLL, or you can import one of the COM wrapper classes and use .NET’s interop capabilities to do the rest. I would personally recommend the COM wrapper, principally because .NET maps all the data types for you, and Charles Reitzel has already done the legwork required to turn the C-based DLL into manageable COM objects. You may also come across references to his “.NET wrapper,” but you can ignore those, because Visual Studio will generate a .NET wrapper automatically. If you prefer to create one yourself, follow this procedure:

  1. Download TidyATL.zip from http://users.rcn.com/creitzel/tidy.html
  2. Extract it somewhere on your hard drive (e.g. c:librariesTidyATL)
  3. Register TidyATL.dll using regsvr32.exe (issue the command regsvr32 TidyATL.dll at a command prompt, using appropriate paths). If you get a registration error, it’s probably because you either don’t have or haven’t registered ATL.dll (not likely on Windows 2000/XP)
  4. Open your Visual Studio project and add a reference to TidyATL.dll. Visual Studio will automatically generate a .NET wrapper for the COM DLL called Interop.TidyATL.dll, which it will place in your project directory. You could equally well have generated this wrapper explicitly with the command “tlbimp TidyATL.dll /out:XXX.dll” where XXX.dll is the name you want to give your exported .NET wrapper, and then added a reference to XXX.dll rather than to TidyATL.dll.
  5. Add “using TidyATL;” (C#) or “imports TidyATL” (VB.NET) directive (optional), to the top of the files where you want to use TidyATL.

If you prefer to let Visual Studio perform the steps, the sample NETTidy project includes TidyATL.dll and the relevant configuration, so all you need to do is open it up in Visual Studio and register TidyATL.dll (perform step 3 of the procedure).

Author’s Note: Though the COM wrapper is the easiest way to get at HTML Tidy’s functionality, it potentially suffers from versioning problems, since TidyATL.dll is statically linked to a particular build of the HTML Tidy library (at the time of writing, this was February 2003). But the source code to TidyATL is available, so you can rebuild it at leisure against any version of the HTML Tidy library you like.

Using TidyATL’s API
TidyATL exposes two top-level interfaces: IDocument and IDocumentEvents. The IDocument interface provides access to an automation-compliant version of HTML Tidy’s “document” object, which in turn exposes high level methods to load, configure, parse, reformat and save HTML, XHTML and XML-based documents. The IDocumentEvents interface is an event sink used to provide asynchronous diagnostic information to the caller.

To start, instantiate a Tidy.Document object:

   Tidy.Document doc = new Tidy.Document();

If you don’t care about diagnostic messages, you can ignore the IDocumentEvents interface. If you want to see such messages though, you should next declare and instantiate an appropriate delegate:

   // Set up an events callback   doc.OnMessage += new       Tidy.IDocumentEvents_OnMessageEventHandler      (TidyDiagnostics);      public void TidyDiagnostics(TidyATL.TidyReportLevel       level, int line,       int col, string message)      {         // Handle the callback message here         ...      }

It’s now simply a matter of reading in a file, setting some processing options, modifying the file according to those options, and then saving the result:

   int err_code = doc.ParseFile(file);      if (err_code < 0)      throw new Exception("Unable to parse file: " +          file);      // Choose "XHTML output" selection (as an example)   doc.SetOptBool(TidyATL.TidyOptionId.TidyXhtmlOut, 1);      // Set option to indent blocks automatically   doc.SetOptInt(TidyATL.TidyOptionId.TidyIndentContent,       2);      // Set indent to 4 chars (as an example)   doc.SetOptInt(TidyATL.TidyOptionId.TidyIndentSpaces,       4);               // Parse the file   err_code = doc.CleanAndRepair();      if (err_code < 0)      throw new Exception(         "Unable to clean/repair file: " + file);      err_code = doc.RunDiagnostics();      if (err_code < 0)      throw new Exception(         "Unable to run diagnostics on file: " + file);      // Commit tidied file   doc.SaveFile(file);

As you can see, it takes very little code to get the HTML Tidy library up and running. What's missing from the example is the enormous enumeration of processing options accessible via TidyATL.TidyOptionId. Many of these are self-explanatory, although unfortunately it looks as if Mr. Reitzel didn't have time to document these options with helpstring attributes, meaning that IntelliSense leaves you on your own.

Your best resource to find out what each option does and what parameter (if any) it takes is to read the somewhat terse Doxygen-based library documentation for the original C header files or the Quick Reference guide. Note that some options are mutually incompatible?for example, if you set both TidyDoctypeMode and TidyXmlOut to TidyYesState (1), you won't get a doctype declaration produced, but if you instead request TidyXhtmlOut or TidyHtml out, you will. That said, I didn't come across too many surprises; I recommend that you experiment until you get the results you want.

Designing the NETTidy Application
The overall goal of this project was to redeploy the HTML Tidy library as a no-frills batch converter. Some subsidiary goals helped keep things simple.

The first was to choose a subset of HTML Tidy's configuration options in order to perform some small but specific task as completely as possible. For example, there are all kinds of options to replace ampersands and quotation marks, wrap particular markup sections in specific ways, or interpret specific markup tags. But I decided to stick with the tags that control the horizontal layout of the page?namely, anything related to indentation, block specification, column width, and tab size. I also threw in a couple of "smart" options simply because they're so useful?one that removes the guff from HTML documents exported from Word 2000, and another that replaces "font" and "center" tags with stylesheet directives. Fundamentally though, NETTidy remains an application for editors who want to format code blocks to specified widths with minimal fuss.

The second was to make NETTidy as fault-tolerant as possible. Case in point: I originally added a set of radio buttons that let you choose between HTML, XHTML, and XML output. I thought this was a good idea, but I quickly decided against it after accidentally selecting "XML" output and running a set of HTML files through the converter. Admittedly I ended up with perfectly good XML, so perhaps it seems a bit churlish to complain. But while Internet Explorer understood my original HTML-based site, it couldn't make head or tail of my XML-based one; worse, converting the XML back to HTML was non-trivial, and I hadn't made a backup. So to prevent scenarios like this from happening again, I removed the buttons and instead hardwired the following rules: "If a file's extension is .HTM or .HTML, convert its content to XHTML, but if it's .XML, stick with XML." Yes, this does mean you can't explicitly request an XHTML to XML transformation, but I can't actually see why you'd want one. Feel free to change the code to suit your needs if you have esoteric requirements.

There was still the issue of my mangled HTML files, though. So I added a few lines of code to backup files to the temp directory before NETTidy gets its claws into them. This way, in the worst case scenario, you can simply copy them back if you change your mind or NETTidy's results don't meet your needs.

Persisting UI Preferences
Finally, it seemed like a good idea to persist whatever options you had chosen between sessions. The obvious place to do this was in the application's .config file, and so I took a cue from the article "How to Make Your .NET Windows Forms Configuration Files Dynamic," by Russell Jones, DevX's Executive Editor. This meant side-stepping the System.Configuration namespace and accessing the file directly as XML. As a result, I wasn't actually obliged to enforce the app.config file's traditional format, but I chose to keep with it anyway, just for good measure. The application serializes preferences (combo box selections, radio button, and checkbox states) to and from the file, using an XPath query to flatten the information as follows:

   // Get the node representing the tab setting   node = doc.DocumentElement.SelectSingleNode(      "//@value[parent::add/      @key='tab']");      // Serialize the txtTab text box to the    // config file ...   node.Value = txtTab.Text;      // ... or deserialize the txtTab text    // box from the config file   txtTab.Text = node.Value;

Configuration properties are located within nodes as follows:

                     ... etc.   

This has the effect of pulling out the "value" attribute associated with the "tab" attribute within a node of type "add," wherever it appears within the document. To avoid unhelpful error messages if you fail to deploy the config file together with the application, I've also added a fallback: If there's no config file, the application simply doesn't preserve preferences.

Tricks with the TreeView Control
You'll also notice that I've subclassed TreeNode to derive a class named StateTreeNode, which exposes a single Boolean property called EverOpened. This was to make browsing for a directory of files to convert more efficient. When you start the application, the tree is initially populated through a call to System.Environment.GetLogicalDrives(). But to give graphical feedback of which drives contain subdirectories (and can therefore be "expanded" in the tree), you need to go a level deeper. The application refers to EverOpened when you expand a node in the tree, to determine whether it has previously shown you that node's list of subdirectories, or whether it needs to go off to the drive in question and physically retrieve them. Hitting "F5" resets the flag, forcing the currently selected node to recalculate its immediate subdirectory structure.

You can see this in action by using the TreeView to browse to some location on your hard drive, and then adding or removing a subdirectory. Upon hitting F5, NETTidy picks up your changes and refreshes the display, just like Windows Explorer. The TreeView indexes folders on your hard drive on demand, rather than having to build a complete view in advance, which improves the application's responsiveness and reduces its startup time. It also means nodes within the tree retain their individual expanded or collapsed state until you explicitly refresh them, reducing flicker and making browsing easier. I personally work with TreeView controls a lot, and have found this subclassing technique invaluable, so feel free to reuse it in your own projects.

A Word of Caution
HTML Tidy is not a panacea for solving all your markup problems, and you should be prepared for the fact it may change working HTML into reformed HTML or XHTML that no longer "works." This is usually because the "working HTML" in question does not in fact comply with its doctype (explicit or implied), but your particular browser produces what appears to be "correct" behavior anyway. For example, I've been guilty of nesting TABLE tags within SPAN tags. According to the HTML 4.0 Transitional doctype I've been using, that isn't permissible, but in Internet Explorer 6 I end up with the effect I want all the same. However, if I were to update my doctype to XHTML 1.0, my tables would no longer position "correctly." While you can generally rely on HTML Tidy to alert you to potential problems like this, its resolutions may not always make immediate sense if you don't appreciate the logic behind its decisions. In this case, it took the following source:

               
Test

Then HTML Tidy rendered the output as follows:

                     
Test

Duplicating the tag might look like an error, but in fact the only sensible way to fix the illegal nesting is to close the before the table, and then reopen it again afterwards. HTML Tidy's diagnostics, sent to NETTidy's output panel, explain what it's done:

   TidyWarning: (6, 1): missing  before    TidyWarning: (10, 4): inserting implicit 

Despite such minor problems, in the final assessment, HTML Tidy is a powerful API for parsing, altering, and formatting HTML, and it continues to be developed and refined. As you've seen, it's easy to incorporate it into your .NET projects?and it's worth downloading and using for its diagnostics alone. NETTidy leverages only a little of its power; there's a lot still left there under the hood, so I encourage you to use it as a springboard for further development in your own projects.

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist

©2024 Copyright DevX - All Rights Reserved. Registration or use of this site constitutes acceptance of our Terms of Service and Privacy Policy.