Very interesting reading. Thanks for sharing this info. At one point last=
year I was proposing a content scraping toolkit to a potential client based=
on Tidy->XHTML =3D> XSLT =3D> Data. There is a Perl package that uses this=
It sounds like you use "disposable" TidyDoc objects: use it once and throw=
it away. Imo, this approach favors the static lookup tables for two=20
reasons: a) reduced startup time (no hash table creation) and b) reduced=20
memory footprint. I.e. each object can literally share the static lookup=20
tables in memory. This makes the code cpu-cache-friendly.
Still, as you say, the balance does depend on the content itself: document=
size, mix of elements, etc. When you get to that point, do you mind=20
sharing the results of your profiling? My prediction: I/O is the biggest=20
factor by far.
Also, have you tried building with Bj=F6rn's character set handling=20
extensions turned on? grep for TIDY_ICONV_SUPPORT and=20
TIDY_WIN32_MLANG_SUPPORT. It appears that the WIN32 side of things is=20
pretty much there. Not sure about ICONV. You should be able to use these=
as a model to write an ICU adapter.
off to work,
At 03:22 PM 9/16/2003 +0100, Mark Weaver wrote:
> > Sounds good to me. I can't help but think your allocator idea would=
> > embedded/PDA developers as well.
>Definitely. Custom allocators help code fit into all sorts of odd places,=
>and generally make a library a lot easier to use in varied environments.=20
>I'll plump for that strategy then, and provide a patch probably for early=
> > Question: for server side work, do you have an opinion
> > about name lookups? I.e. using a hash vs. a linear
> > search. In various systems, I have found the setup
> > time/space for creating the hash tables to not always
> > be worth the reduced lookup times for lists less than
> > ~1000 items. Since you are dealing with large bulks of
> > content, have you played with this?
>Not yet, I'm still on correctness as opposed to speed. When I get around=
>to speed optimisation I tend just to run a profiler over the whole system=
>and then see what's consuming the majority of the time for typical=
>For hashes in general, whether creating a hash is worth it depends heavily=
>on the running time of the hash function vs the comparison function for=20
>small tables and the number of lookups that are performed. It's also=20
>affected by getting the table size large enough at the start to obviate too
>much rehashing when growing the table. I'd need to collect some data on=20
>typical documents before I could say for sure what the best strategy for=20
>In overall performance terms, the only huge sore thumb I found was=20
>CheckNodeIntegrity, which is (approximately) quadratic in the number of=20
>nodes. On smaller (<3Gb) data sets, I didn't find this ever failed -- it=
>looks to me like a debug aid (I compile it out with the appropriate #define
>these days). You can see this behaviour by creating, say, a big index=20
>page with a huge list of hrefs (e.g. a directory listing from apache on a=
>directory with 10k documents).
> > Also, what is your programming style? Do you a) create
> > and destroy many TidyDoc objects, or b) do you keep one
> > around and use it over and over? Approach a) is
> > appropriate for multi-threaded systems (one per
> > thread) and b) is might be better for a single-threaded
> > approach. In general, I would interested in hearing
> > about your test bed.
>I do a). The code relies on being able to index documents in parallel, to=
>keep the CPU usage up in the face of I/O waits, and to take advantage of=20
>the extra CPU power available in SMP machines.
>The test bed is our indexing software, which is mainly focused on=20
>categorising documents (and automatically spotting new documents that fit=
>the bill). The data is collected from websites, and e.g., job listings or=
>news are automatically picked up. This will be used in "portal" sites,=20
>that have groups of member websites, and want to keep their listings on=20
>their member companies up to date. Because we collect our data in this=20
>fashion, we get enormous quantities of wacky and varied HTML pages to=20
>throw at tidy.
>So far, it's coped admirably! We also store the tidied page as XML, which=
>can be used to display custom search results very nicely with just a bit=20
>In the indexing code, we use tidy to get a nice tree view on the HTML,=20
>which makes for easy extraction of information (such as title, meta tags,=
>font sizes/headings/emphasis/links, all of which can be used to help with=
>ranking the search results. We've also got a preprocessor that scans for=
>the character set and then attaches ICU (http://oss.software.ibm.com/icu/)=
>to the document if it's something wacky -- I will probably look into=20
>hooking this directly into tidy at some point to save the preprocessing,=20
>and bring the benefits of being able to understand virtually any character=
>Character sets unfortunately are a bit of a black art as most people don't=
>seem to understand character sets properly (the most common so far is to=20
>get iso-8859-1 pages with "charset=3Dutf8" meta tags in), which is=
>I'm still working on.