From: Larry W. V. <lv...@ca...> - 2002-01-22 16:10:45
|
Hello, I have cobbled together a small script that demonstrates a technique I am trying. The goal - given a 'page' of html, I would like to print out the text portion of the page. Here's my script and the output I have so far: $ cat trial.tcl package require http package require htmlparse package require struct package require textutil package require log proc treeprint {t n} { set tp [$t get $n -key type] set d [$t depth $n] set idx "" catch {set idx [$t index $n]} incr d $d incr d $d switch -exact -- $tp { a { log::log debug "[textutil::strRepeat " " $d]$idx $tp ([$t get $n -key data]...)" } PCDATA { log::log debug "[textutil::strRepeat " " $d]$idx $tp ([string range [$t get $n -key data] 0 20]...)" } default { log::log debug "[textutil::strRepeat " " $d]$idx $tp" } } } set html {<HTML><BODY><hr>test é<br>data</BODY></HTML>} # Convert the resulting html set html2 [::htmlparse::mapEscapes $html] ::struct::tree::tree t ::htmlparse::2tree $html t ::htmlparse::removeVisualFluff t ::htmlparse::removeFormDefs t puts $html puts $html2 # what do I put here to see the resulting $t ? t walk root -command {treeprint %t %n} $ /usr/tcl84/bin/tclsh trial.tcl <HTML><BODY><hr>test é<br>data</BODY></HTML> <HTML><BODY><hr>test é<br>data</BODY></HTML> debug root debug 0 body debug 0 hr debug 0 PCDATA (test é...) debug 1 br debug 2 PCDATA (data...) So I am getting pretty close. Can someone give me a tip as how I could get just those PCDATA types and then get out just the text information? -- Never apply a Star Trek solution to a Babylon 5 problem. Larry W. Virden <mailto:lv...@ca...> <URL: http://www.purl.org/NET/lvirden/> Even if explicitly stated to the contrary, nothing in this posting should be construed as representing my employer's opinions. -><- |