The granddaddy of HTML tools, with support for modern standards
...The HTML Tidy library, libtidy, is used and incorporated into many applications and projects. It offers an extensive API to read in and parse HTML from a file or buffer into a DOM-like node tree, has cleaning and diagnostic services, etc.