[Themis-dev] Another code base

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi all,

  As Kevin told you earlier, I'm the main dev of another project called UZI.
I won't dig to much in infinite details about UZI, I'll try to describe
what I've done, my choices, and my future commitments.

Starting from Jan 07, I had an idea about a new kind of "office"
software. I started documenting on it, the functionalities, the
requirements, and so on.
After some time (I could explain what UZI is in private email, as it's
not Themis theme), I came to the conclusion that I needed a HTML renderer.

So I started to dig the internet about every renderer source code I
could find.
Obviously, I came to the very common one:
  - WebKit
  - Gecko

I've also resurrected links to the outsiders:
  - Themis
  - CHTMLViewer
  - Tcl/Tk HTML
  - Links
  - All those based on GLib

I've looked at the source, and I've made the following conclusions:
  - WebKit, cross-platform, dependent on Quartz or for the KHTML engine
Qt. C++, easy to understand, quite clean, maintained. Not vector-based
rendering
  - Gecko, cross-platform, dependent on too many librairies, huge. C++,
but using extention like XPCom
  - Themis, dependent on BeOS API, quite clean, not supported anymore.
C++, easy to understand.
  - CHTMLViewer, dependent on MFC, could be changed easily to base Win32
code, but specific to Win32 anyway. C++ code, but messy.
  - Tcl/Tk HTML, cross-platform, but dependent on Tcl/Tk I don't know about
  - others...

I wanted UZI to be cross platform, so I only had 3 choices (actually, 4
if including Themis with rewriting)
I excluded WebKit and Gecko due to their license on the dependencies for
the former, and their dependencies for the latter.
I decided not to learn another programming language, so I excluded
Tcl/Tk browser too (which is CSS 2 compliant, and pass the Acid2 test BTW)

So I finally started to look more closely into Themis by march 07.
I checked out the code. The boost dependency was very unexpected (as
it's not compiling everywhere, and slow any compilation by a factor of
at least 2)
Other dependencies were cryptlib, and libz. The javascript engine looked
like those of Mozilla.
I then dug into the modules folder; and started examining the code down
there.
The HTML parser used a lot of heap allocation (new ...), without obvious
object life cycle (where does the object goes, is it shared, and so
on...). Sure the dev relied on boost shared_ptr for their deallocation.
The parser also defined it's internal constant as char *, so it will be
broken with unicode content down there. I've checked the fetching code,
and it didn't convert to UTF-8, so I thought it was WIP.
The parser was cleanly structured between every kind of elements as a
class hierarchy.
However, the class hierarchy didn't looked that obvious to me.

I would have expected an HTMLParser to be a SGMLParser (as HTML is a
part of SGML).
In place of this, HTMLParser is dependent on BeOS code (why ?), deals
with OS specific task and message handling (why ?)
And in fact it doesn't parse anything.

The DOM tree is correct (in fact it's very clean implementation) of DOM
interfaces.
The renderer was inexistant (no rendering at all, in fact, it even
lacked the CSS box model's algorithm to map DOM nodes to boxes, and
reflow them).

As I didn't took time to understand the structure of the code, creating
a renderer from Themis project would have required studying this in
details, and rewriting a lot of BeOS specific code (even probably
rewriting the hierarchy).

So finally, I decided to roll my own.
I priviledged compliant and memory clean code over speed.
I started to implement DOM v2.1 interfaces in UZI.
Then an charset aware SGML parser (even through only UTF8 and latin1 is
supported for now)
Then I added an HTML parser to produce a DOM tree
I then added an CSS box model mapper (including CSS types parsing, and
structures), and table rendering
I finally implemented a renderer for all the CSS boxes spit from the
mapper.
I've added a CacheEngine with fast lookup too.

My design included allocators from the beginning (meaning there is no
"new" in the code except in the allocators, and every object is returned
to the allocator when required). This means that I'll be able to reuse
returned objects instead of calling the heap again (a page like
news.google.fr can take up to12MB of memory just for storing the picture
and text).
I also added test vectors from the beginning (I'm amazed to see that
they are still some many projects around that doesn't even test their
code correctness), so I'm sure the DOM implementation works 100% as
expected. The HTML parser works 100% as expected on VALID documents with
supported charset.
The testing the CSS mapping isn't concluant yet (and testing them
automatically is very difficult), so there is no test vector for it,
and, as obvious as it may seems, most of the issue I had (and still
have) come from that part exactly.

I wanted the design to be as platform dependent as possible.
As such, from the HTML parser up to the boxes to render, there is no
platform specific code. There is no dependency (not even on STL, as
cross platform STL code doesn't work) at all except C library.

I've simplified the renderer code to a single 14 methods interface, so
implementing a renderer only requires to implement 14 methods (for
measuring text, images and replaced content...)
The HTML parser take a stream as input, and it currently accept UTF8
streams, but adding charset is foreseen (requires implementing a 28
methods interface).

Having done this, I needed to check all that code in real world scenario.
I've choosen Juce ( http://www.rawmaterialsoftware.com/juce ) as the
cross platform engine because it natively support vector graphics (this
is needed for UZI), it's really, really clean C++.
I've then implemented the renderer in a Juce component (the total
renderer code is 2k lines, including own Juce code for rendering)
I've implemented a HTTP client using Juce primitives, a Thread pool for
fetching jobs, a DOM viewer, an error console, and finally a almost
fully functionnal browser.

Ouf,  here we are.
I needed a browser to test the renderer on real world code.
I started testing sites around, and got very pleasant and unpleasant
surprises.
The hardest part is probably creating a DOM tree like the author
intended, but correcting the errors the author made.
There is a DTD for validating documents, but most (if not all) website
just don't care about this, and almost no website validates the DTD.
So I had to handle so many errors, I not even sure I'm correct by now.

The current state is:
  - The browser actually renders web pages, like any non CSS browser and
allow to browse (well... sortof as form aren't POSTed yet)
  - I still need to include the CSS parsing code to CSS properties
  - I still need to correct some rendering error on the CSS mapper
  - I still need to find a way to map CSS rules to DOM node very very fast.
  - I still have to choose a Javascript engine, and bind it to the DOM
interface (my current choice would probably be See).
  - I still have to write tests for all the above.

I'm amazed to see Themis starting to work again. I don't follow that
much Haiku, but I do think it's good to have choice.
I've met Kevin on SeeDev mailing list, and he told me about your great news.

I'm not going to tell you "trash themis, use UZI's renderer, you'll have
results in 2 weeks".
I'm proud of my code and I suppose you are proud of yours.

Anyway, I'm telling you about my experiences, so you can avoid falling
in the same traps.
I'm copying a discussion we had with Kevin:

Kevin wrote:
>Unlike most open source projects you will find for other Operating 
Systems, Haiku projects can be quite different.
>The design of the BeOS (and now Haiku) focused on performance and 
providing a great API.
>As I understand it, designing a native BeOS and Haiku app almost 
forces you to develop applications in a multi-threaded manner (though
you can get around this).
>So, although you can port applications from another OS to Haiku, you 
will not get the benefits of developing a native app unless you take the
time to make things more multi-threaded as you move to Haiku.
>So, in the case of Themis, and Haiku applications, it is almost a bad 
idea to try for cross-platform.
>The goals are to develop high performance apps, based on a cleaner 
API, and taking advantage of Haiku's features.
>So, often this means developing a native Haiku app first.
>
>As for Themis, the design concepts for the project seems rather 
interesting:
>The core of Themis has a multi-threaded approach using messaging. 
>As one person described it, each component broadcasts a message that 
gets picked up by the appropriate component and gets used.
>For example, as he described it, a user types in an address and the 
address bar broadcasts a message about the address typed in...
>the appropriate component picks that up (whether that be ftp or http 
handling or whatever).
>The http handler would get the data and broadcast the message that it 
has the data.
>Then the appropriate component (say the parser to DOM) picks that up 
and so on....
>I would think going back to any other less threaded solution would be 
a step backwards for Themis.
>If anything from UZI was integrated, it would have to incorporate this 
multi-threaded approach using messaging (which I suspect is quite
possible).
>It would also need to take advantage of Haiku's APIs and widgets, as 
opposed to using cross-platform ones, where possible (not sure how
difficult this is).

I answered:
>I've used this programming scheme before (well, Win32 message 
programming I mean).
>However I've dropped the idea of message based task management.
>If you think about this, the application is only always doing the same 
task in the same order (HTML fetching -> HTML parsing -> DOM CSS mapping
-> CSS box rendering) everytime.
>This means that you pay the message queue cost between every subtask 
while you really don't need to (no one else is using this message anyway).
>Sure it looks like your application is more modular, but in reality 
it's not, because all modules still depends on each other (structure
declaration for one when passing objects in messages), and it adds
useless the OS messaging queue dependency.
>
>In UZI application part, I'm using a thread pool.
>Each thread can execute jobs (like fetching documents from server).
>However, from the parsing to the mapping it's a single job.
>
>I'm more concerned about memory handling than making the parser run in 
a separate thread than the renderer.
>UZI memory allocation all goes to a specific interface (BaseAllocator) 
that can decide to reuse objects and so on.
>There is more performance to gain from a correct memory allocation 
than from multithreading something that is inherently linear and
sequential.

Concerning the license through, we have discussed a lot with Kevin.
Initially this is what I want to ensure:
1) UZI is open-source (it'll never become closed source, à la Sourceforge)
2) Anyone is free to use it in open-source project, provided (s)he
doesn't change the license and send back any change they make to us.
3) There is no restriction to the other license in the project you use
UZI in (it's not viral)
4) UZI can be used in commercial application from some organization,
provided the organization release the source code of UZI to their
customers, and send back the changes they've made to UZI to us
5) If used in a closed source product, then some organization need to
buy a license. The license money will be shared between UZI contributors
proportionnally to their contribution as a correct retribution for their
hard work
6) UZI can't be relicensed differently unless all contributors agree to
do so commonly.

I haven't found exactly this license terms anywhere. Initially, I've put
UZI in dual licensing, GPL (for the restrictive part it includes), and
commercial license (like Trolltech or Juce). I don't like the current
terms, as it's very restrictive, doesn't allow external contributors
(unless they want to work for free, and have their name on top of a
file, but I don't believe this happens). If any of you have better idea
for the license, please answer to me, I'll talk with you directly.

UZI is currently my own work, 100% personal, so I can decide to license
it the way I want (for now). Later in time, it might not be the case.
Please dig around in UZI source code (
http://tools.assembla.com/uzi/browser/trunk/include and
http://tools.assembla.com/uzi/browser/trunk/src ) if you want to select
stuff you could be interested in.

Ok, that's it.
I'll look further to your answer.

Cheers,
Cyril