HTMLDOM cannot correclty parse html element without proper closing tag
HTML parser which can be used for screen-scraping applications
Brought to you by:
bhimsen92
Htmldom cannot parse tags which aren't closed properly.
For example:
<AREA SHAPE="RECT" COORDS="2,2,95,30" HREF="../index.shtml" alt="Home">
According to html standats it is acceptable, but htmldom fails to correctly parse it.
For example:
from htmldom import htmldom
dom = htmldom.HtmlDom().createDom("""
<body>
<MAP NAME="top_nav_map">
<AREA SHAPE="RECT" COORDS="2,2,95,30" HREF="../index.shtml" alt="Home">
<AREA SHAPE="RECT" COORDS="99,2,220,30" HREF="../Components.shtml" alt="Components">
<AREA SHAPE="RECT" COORDS="224,2,319,30" HREF="../HardwareMain.shtml" alt="Hardware">
<AREA SHAPE="RECT" COORDS="324,2,402,30" HREF="../Boards.shtml" alt="Boards">
<AREA SHAPE="RECT" COORDS="406,2,477,30" HREF="../BooksMain.shtml" alt="Books">
<AREA SHAPE="RECT" COORDS="482,2,535,30" HREF="../Kits.shtml" alt="Kits">
</MAP>
<h1>Hello</h1>
</body>
""")
table = dom.find("body")
print(table.html())
This code print:
<body>
<map NAME="top_nav_map">
<area COORDS="2,2,95,30" SHAPE="RECT" alt="Home" HREF="../index.shtml">
<area COORDS="99,2,220,30" SHAPE="RECT" alt="Components" HREF="../Components.shtml">
<area COORDS="224,2,319,30" SHAPE="RECT" alt="Hardware" HREF="../HardwareMain.shtml">
<area COORDS="324,2,402,30" SHAPE="RECT" alt="Boards" HREF="../Boards.shtml">
<area COORDS="406,2,477,30" SHAPE="RECT" alt="Books" HREF="../BooksMain.shtml">
<area COORDS="482,2,535,30" SHAPE="RECT" alt="Kits" HREF="../Kits.shtml">
</area>
<h1>
Hello
</h1>
</area>
</area>
</area>
</area>
</area>
</map>
</body>
Anonymous