|
From: Cyril R. <sta...@la...> - 2007-09-20 14:34:02
|
Hi all, As Kevin told you earlier, I'm the main dev of another project called UZI. I won't dig to much in infinite details about UZI, I'll try to describe what I've done, my choices, and my future commitments. Starting from Jan 07, I had an idea about a new kind of "office" software. I started documenting on it, the functionalities, the requirements, and so on. After some time (I could explain what UZI is in private email, as it's not Themis theme), I came to the conclusion that I needed a HTML renderer. So I started to dig the internet about every renderer source code I could find. Obviously, I came to the very common one: - WebKit - Gecko I've also resurrected links to the outsiders: - Themis - CHTMLViewer - Tcl/Tk HTML - Links - All those based on GLib I've looked at the source, and I've made the following conclusions: - WebKit, cross-platform, dependent on Quartz or for the KHTML engine Qt. C++, easy to understand, quite clean, maintained. Not vector-based rendering - Gecko, cross-platform, dependent on too many librairies, huge. C++, but using extention like XPCom - Themis, dependent on BeOS API, quite clean, not supported anymore. C++, easy to understand. - CHTMLViewer, dependent on MFC, could be changed easily to base Win32 code, but specific to Win32 anyway. C++ code, but messy. - Tcl/Tk HTML, cross-platform, but dependent on Tcl/Tk I don't know about - others... I wanted UZI to be cross platform, so I only had 3 choices (actually, 4 if including Themis with rewriting) I excluded WebKit and Gecko due to their license on the dependencies for the former, and their dependencies for the latter. I decided not to learn another programming language, so I excluded Tcl/Tk browser too (which is CSS 2 compliant, and pass the Acid2 test BTW) So I finally started to look more closely into Themis by march 07. I checked out the code. The boost dependency was very unexpected (as it's not compiling everywhere, and slow any compilation by a factor of at least 2) Other dependencies were cryptlib, and libz. The javascript engine looked like those of Mozilla. I then dug into the modules folder; and started examining the code down there. The HTML parser used a lot of heap allocation (new ...), without obvious object life cycle (where does the object goes, is it shared, and so on...). Sure the dev relied on boost shared_ptr for their deallocation. The parser also defined it's internal constant as char *, so it will be broken with unicode content down there. I've checked the fetching code, and it didn't convert to UTF-8, so I thought it was WIP. The parser was cleanly structured between every kind of elements as a class hierarchy. However, the class hierarchy didn't looked that obvious to me. I would have expected an HTMLParser to be a SGMLParser (as HTML is a part of SGML). In place of this, HTMLParser is dependent on BeOS code (why ?), deals with OS specific task and message handling (why ?) And in fact it doesn't parse anything. The DOM tree is correct (in fact it's very clean implementation) of DOM interfaces. The renderer was inexistant (no rendering at all, in fact, it even lacked the CSS box model's algorithm to map DOM nodes to boxes, and reflow them). As I didn't took time to understand the structure of the code, creating a renderer from Themis project would have required studying this in details, and rewriting a lot of BeOS specific code (even probably rewriting the hierarchy). So finally, I decided to roll my own. I priviledged compliant and memory clean code over speed. I started to implement DOM v2.1 interfaces in UZI. Then an charset aware SGML parser (even through only UTF8 and latin1 is supported for now) Then I added an HTML parser to produce a DOM tree I then added an CSS box model mapper (including CSS types parsing, and structures), and table rendering I finally implemented a renderer for all the CSS boxes spit from the mapper. I've added a CacheEngine with fast lookup too. My design included allocators from the beginning (meaning there is no "new" in the code except in the allocators, and every object is returned to the allocator when required). This means that I'll be able to reuse returned objects instead of calling the heap again (a page like news.google.fr can take up to12MB of memory just for storing the picture and text). I also added test vectors from the beginning (I'm amazed to see that they are still some many projects around that doesn't even test their code correctness), so I'm sure the DOM implementation works 100% as expected. The HTML parser works 100% as expected on VALID documents with supported charset. The testing the CSS mapping isn't concluant yet (and testing them automatically is very difficult), so there is no test vector for it, and, as obvious as it may seems, most of the issue I had (and still have) come from that part exactly. I wanted the design to be as platform dependent as possible. As such, from the HTML parser up to the boxes to render, there is no platform specific code. There is no dependency (not even on STL, as cross platform STL code doesn't work) at all except C library. I've simplified the renderer code to a single 14 methods interface, so implementing a renderer only requires to implement 14 methods (for measuring text, images and replaced content...) The HTML parser take a stream as input, and it currently accept UTF8 streams, but adding charset is foreseen (requires implementing a 28 methods interface). Having done this, I needed to check all that code in real world scenario. I've choosen Juce ( http://www.rawmaterialsoftware.com/juce ) as the cross platform engine because it natively support vector graphics (this is needed for UZI), it's really, really clean C++. I've then implemented the renderer in a Juce component (the total renderer code is 2k lines, including own Juce code for rendering) I've implemented a HTTP client using Juce primitives, a Thread pool for fetching jobs, a DOM viewer, an error console, and finally a almost fully functionnal browser. Ouf, here we are. I needed a browser to test the renderer on real world code. I started testing sites around, and got very pleasant and unpleasant surprises. The hardest part is probably creating a DOM tree like the author intended, but correcting the errors the author made. There is a DTD for validating documents, but most (if not all) website just don't care about this, and almost no website validates the DTD. So I had to handle so many errors, I not even sure I'm correct by now. The current state is: - The browser actually renders web pages, like any non CSS browser and allow to browse (well... sortof as form aren't POSTed yet) - I still need to include the CSS parsing code to CSS properties - I still need to correct some rendering error on the CSS mapper - I still need to find a way to map CSS rules to DOM node very very fast. - I still have to choose a Javascript engine, and bind it to the DOM interface (my current choice would probably be See). - I still have to write tests for all the above. I'm amazed to see Themis starting to work again. I don't follow that much Haiku, but I do think it's good to have choice. I've met Kevin on SeeDev mailing list, and he told me about your great news. I'm not going to tell you "trash themis, use UZI's renderer, you'll have results in 2 weeks". I'm proud of my code and I suppose you are proud of yours. Anyway, I'm telling you about my experiences, so you can avoid falling in the same traps. I'm copying a discussion we had with Kevin: Kevin wrote: >Unlike most open source projects you will find for other Operating Systems, Haiku projects can be quite different. >The design of the BeOS (and now Haiku) focused on performance and providing a great API. >As I understand it, designing a native BeOS and Haiku app almost forces you to develop applications in a multi-threaded manner (though you can get around this). >So, although you can port applications from another OS to Haiku, you will not get the benefits of developing a native app unless you take the time to make things more multi-threaded as you move to Haiku. >So, in the case of Themis, and Haiku applications, it is almost a bad idea to try for cross-platform. >The goals are to develop high performance apps, based on a cleaner API, and taking advantage of Haiku's features. >So, often this means developing a native Haiku app first. > >As for Themis, the design concepts for the project seems rather interesting: >The core of Themis has a multi-threaded approach using messaging. >As one person described it, each component broadcasts a message that gets picked up by the appropriate component and gets used. >For example, as he described it, a user types in an address and the address bar broadcasts a message about the address typed in... >the appropriate component picks that up (whether that be ftp or http handling or whatever). >The http handler would get the data and broadcast the message that it has the data. >Then the appropriate component (say the parser to DOM) picks that up and so on.... >I would think going back to any other less threaded solution would be a step backwards for Themis. >If anything from UZI was integrated, it would have to incorporate this multi-threaded approach using messaging (which I suspect is quite possible). >It would also need to take advantage of Haiku's APIs and widgets, as opposed to using cross-platform ones, where possible (not sure how difficult this is). I answered: >I've used this programming scheme before (well, Win32 message programming I mean). >However I've dropped the idea of message based task management. >If you think about this, the application is only always doing the same task in the same order (HTML fetching -> HTML parsing -> DOM CSS mapping -> CSS box rendering) everytime. >This means that you pay the message queue cost between every subtask while you really don't need to (no one else is using this message anyway). >Sure it looks like your application is more modular, but in reality it's not, because all modules still depends on each other (structure declaration for one when passing objects in messages), and it adds useless the OS messaging queue dependency. > >In UZI application part, I'm using a thread pool. >Each thread can execute jobs (like fetching documents from server). >However, from the parsing to the mapping it's a single job. > >I'm more concerned about memory handling than making the parser run in a separate thread than the renderer. >UZI memory allocation all goes to a specific interface (BaseAllocator) that can decide to reuse objects and so on. >There is more performance to gain from a correct memory allocation than from multithreading something that is inherently linear and sequential. Concerning the license through, we have discussed a lot with Kevin. Initially this is what I want to ensure: 1) UZI is open-source (it'll never become closed source, à la Sourceforge) 2) Anyone is free to use it in open-source project, provided (s)he doesn't change the license and send back any change they make to us. 3) There is no restriction to the other license in the project you use UZI in (it's not viral) 4) UZI can be used in commercial application from some organization, provided the organization release the source code of UZI to their customers, and send back the changes they've made to UZI to us 5) If used in a closed source product, then some organization need to buy a license. The license money will be shared between UZI contributors proportionnally to their contribution as a correct retribution for their hard work 6) UZI can't be relicensed differently unless all contributors agree to do so commonly. I haven't found exactly this license terms anywhere. Initially, I've put UZI in dual licensing, GPL (for the restrictive part it includes), and commercial license (like Trolltech or Juce). I don't like the current terms, as it's very restrictive, doesn't allow external contributors (unless they want to work for free, and have their name on top of a file, but I don't believe this happens). If any of you have better idea for the license, please answer to me, I'll talk with you directly. UZI is currently my own work, 100% personal, so I can decide to license it the way I want (for now). Later in time, it might not be the case. Please dig around in UZI source code ( http://tools.assembla.com/uzi/browser/trunk/include and http://tools.assembla.com/uzi/browser/trunk/src ) if you want to select stuff you could be interested in. Ok, that's it. I'll look further to your answer. Cheers, Cyril |