Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Fix Up Your HTML with HTML Tidy and .NET : Page 2

When standards change, your development efforts must often change with them. But change doesn't always have to be painful. If you're trying to upgrade your HTML pages to the latest standards, fix unclosed tags, find and fix deprecated features, and format all your Web pages consistently, HTML Tidy is just what the doctor ordered.




Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js

Date: 1/31/2018 @ 2 p.m. ET

Invoking HTML Tidy from .NET
There are basically two sensible ways to get HTML Tidy working on the .NET platform. You can either use P/Invoke to call directly into the DLL, or you can import one of the COM wrapper classes and use .NET's interop capabilities to do the rest. I would personally recommend the COM wrapper, principally because .NET maps all the data types for you, and Charles Reitzel has already done the legwork required to turn the C-based DLL into manageable COM objects. You may also come across references to his ".NET wrapper," but you can ignore those, because Visual Studio will generate a .NET wrapper automatically. If you prefer to create one yourself, follow this procedure:

  1. Download TidyATL.zip from http://users.rcn.com/creitzel/tidy.html
  2. Extract it somewhere on your hard drive (e.g. c:\libraries\TidyATL)
  3. Register TidyATL.dll using regsvr32.exe (issue the command regsvr32 TidyATL.dll at a command prompt, using appropriate paths). If you get a registration error, it's probably because you either don't have or haven't registered ATL.dll (not likely on Windows 2000/XP)
  4. Open your Visual Studio project and add a reference to TidyATL.dll. Visual Studio will automatically generate a .NET wrapper for the COM DLL called Interop.TidyATL.dll, which it will place in your project directory. You could equally well have generated this wrapper explicitly with the command "tlbimp TidyATL.dll /out:XXX.dll" where XXX.dll is the name you want to give your exported .NET wrapper, and then added a reference to XXX.dll rather than to TidyATL.dll.
  5. Add "using TidyATL;" (C#) or "imports TidyATL" (VB.NET) directive (optional), to the top of the files where you want to use TidyATL.
If you prefer to let Visual Studio perform the steps, the sample NETTidy project includes TidyATL.dll and the relevant configuration, so all you need to do is open it up in Visual Studio and register TidyATL.dll (perform step 3 of the procedure).

Author's Note: Though the COM wrapper is the easiest way to get at HTML Tidy's functionality, it potentially suffers from versioning problems, since TidyATL.dll is statically linked to a particular build of the HTML Tidy library (at the time of writing, this was February 2003). But the source code to TidyATL is available, so you can rebuild it at leisure against any version of the HTML Tidy library you like.

Using TidyATL's API
TidyATL exposes two top-level interfaces: IDocument and IDocumentEvents. The IDocument interface provides access to an automation-compliant version of HTML Tidy's "document" object, which in turn exposes high level methods to load, configure, parse, reformat and save HTML, XHTML and XML-based documents. The IDocumentEvents interface is an event sink used to provide asynchronous diagnostic information to the caller.

To start, instantiate a Tidy.Document object:

Tidy.Document doc = new Tidy.Document();

If you don't care about diagnostic messages, you can ignore the IDocumentEvents interface. If you want to see such messages though, you should next declare and instantiate an appropriate delegate:

// Set up an events callback doc.OnMessage += new Tidy.IDocumentEvents_OnMessageEventHandler (TidyDiagnostics); public void TidyDiagnostics(TidyATL.TidyReportLevel level, int line, int col, string message) { // Handle the callback message here ... }

It's now simply a matter of reading in a file, setting some processing options, modifying the file according to those options, and then saving the result:

int err_code = doc.ParseFile(file); if (err_code < 0) throw new Exception("Unable to parse file: " + file); // Choose "XHTML output" selection (as an example) doc.SetOptBool(TidyATL.TidyOptionId.TidyXhtmlOut, 1); // Set option to indent blocks automatically doc.SetOptInt(TidyATL.TidyOptionId.TidyIndentContent, 2); // Set indent to 4 chars (as an example) doc.SetOptInt(TidyATL.TidyOptionId.TidyIndentSpaces, 4); // Parse the file err_code = doc.CleanAndRepair(); if (err_code < 0) throw new Exception( "Unable to clean/repair file: " + file); err_code = doc.RunDiagnostics(); if (err_code < 0) throw new Exception( "Unable to run diagnostics on file: " + file); // Commit tidied file doc.SaveFile(file);

As you can see, it takes very little code to get the HTML Tidy library up and running. What's missing from the example is the enormous enumeration of processing options accessible via TidyATL.TidyOptionId. Many of these are self-explanatory, although unfortunately it looks as if Mr. Reitzel didn't have time to document these options with helpstring attributes, meaning that IntelliSense leaves you on your own.

Your best resource to find out what each option does and what parameter (if any) it takes is to read the somewhat terse Doxygen-based library documentation for the original C header files or the Quick Reference guide. Note that some options are mutually incompatible—for example, if you set both TidyDoctypeMode and TidyXmlOut to TidyYesState (1), you won't get a doctype declaration produced, but if you instead request TidyXhtmlOut or TidyHtml out, you will. That said, I didn't come across too many surprises; I recommend that you experiment until you get the results you want.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date