|
From: Mark H. <ma...@fi...> - 2007-09-23 21:47:41
|
>Hi all, Hi [snip] >So I finally started to look more closely into Themis by march 07. I >checked out the code. The boost dependency was very unexpected (as it's >not compiling everywhere, and slow any compilation by a factor of at >least 2) Other dependencies were cryptlib, and libz. The javascript >engine looked like those of Mozilla. I then dug into the modules folder; >and started examining the code down there. The HTML parser used a lot of >heap allocation (new ...), without obvious object life cycle (where does >the object goes, is it shared, and so on...). Sure the dev relied on >boost shared_ptr for their deallocation. The parser also defined it's >internal constant as char *, so it will be broken with unicode content >down there. I've checked the fetching code, and it didn't convert to >UTF-8, so I thought it was WIP. The parser was cleanly structured between >every kind of elements as a class hierarchy. However, the class hierarchy >didn't looked that obvious to me. > >I would have expected an HTMLParser to be a SGMLParser (as HTML is a part >of SGML). In place of this, HTMLParser is dependent on BeOS code (why ?), >deals with OS specific task and message handling (why ?) And in fact it >doesn't parse anything. Let me explain a bit about the HTML parser. It is indeed an SGML parser and it does work. You can feed it any document, for which you have a DTD and it should create a DOM tree. I know we will get trouble with unicode, but we will cross that bridge when we get there. The BeOS specific code is only to hook it up to the Themis messaging system. I have the SGML parser working on my Irix box as well. >The DOM tree is correct (in fact it's very clean implementation) of DOM >interfaces. The renderer was inexistant (no rendering at all, in fact, it >even lacked the CSS box model's algorithm to map DOM nodes to boxes, and >reflow them). > >As I didn't took time to understand the structure of the code, creating a >renderer from Themis project would have required studying this in >details, and rewriting a lot of BeOS specific code (even probably >rewriting the hierarchy). > >So finally, I decided to roll my own. I priviledged compliant and memory >clean code over speed. I started to implement DOM v2.1 interfaces in UZI. >Then an charset aware SGML parser (even through only UTF8 and latin1 is >supported for now) Then I added an HTML parser to produce a DOM tree I >then added an CSS box model mapper (including CSS types parsing, and >structures), and table rendering I finally implemented a renderer for all >the CSS boxes spit from the mapper. I've added a CacheEngine with fast >lookup too. That is a very impressive amount of functionality. >My design included allocators from the beginning (meaning there is no >"new" in the code except in the allocators, and every object is returned >to the allocator when required). This means that I'll be able to reuse >returned objects instead of calling the heap again (a page like >news.google.fr can take up to12MB of memory just for storing the picture >and text). I also added test vectors from the beginning (I'm amazed to >see that they are still some many projects around that doesn't even test >their code correctness), so I'm sure the DOM implementation works 100% as >expected. The HTML parser works 100% as expected on VALID documents with >supported charset. The testing the CSS mapping isn't concluant yet (and >testing them automatically is very difficult), so there is no test vector >for it, and, as obvious as it may seems, most of the issue I had (and >still have) come from that part exactly. Ah, testing. Yes, that is always a bit of a forgotten part of most projects. I try to do the basic tests, but I haven't gotten around to writing more extensive tests. [snip Uzi design] >Ouf, here we are. I needed a browser to test the renderer on real world >code. I started testing sites around, and got very pleasant and >unpleasant surprises. The hardest part is probably creating a DOM tree >like the author intended, but correcting the errors the author made. >There is a DTD for validating documents, but most (if not all) website >just don't care about this, and almost no website validates the DTD. So I >had to handle so many errors, I not even sure I'm correct by now. Yeah, it's probably the fault of the early browsers that just accepted anything. I'm hoping to build in good error reporting in the Themis HTML parser, so website designers using Themis to test their websites can easily spot mistakes in their design. >The current state is: - The browser actually renders web pages, like >any non CSS browser and allow to browse (well... sortof as form aren't >POSTed yet) - I still need to include the CSS parsing code to CSS >properties - I still need to correct some rendering error on the CSS >mapper - I still need to find a way to map CSS rules to DOM node very >very fast. - I still have to choose a Javascript engine, and bind it to >the DOM interface (my current choice would probably be See). - I still >have to write tests for all the above. That is very impressive. >I'm amazed to see Themis starting to work again. I don't follow that much >Haiku, but I do think it's good to have choice. I've met Kevin on SeeDev >mailing list, and he told me about your great news. Well, I'm not sure yet wether to pick it up again, but we'll see. >I'm not going to tell you "trash themis, use UZI's renderer, you'll have >results in 2 weeks". I'm proud of my code and I suppose you are proud of >yours. Depends on which part of the code we are talking about. :) [snip Kevin's explanation of Themis] >I answered: >>I've used this programming scheme before (well, Win32 message >programming I mean). >>However I've dropped the idea of message based task management. >>If you think about this, the application is only always doing the same >task in the same order (HTML fetching -> HTML parsing -> DOM CSS mapping >-> CSS box rendering) everytime. >>This means that you pay the message queue cost between every subtask >while you really don't need to (no one else is using this message >anyway). >>Sure it looks like your application is more modular, but in reality it's >not, because all modules still depends on each other (structure >declaration for one when passing objects in messages), and it adds >useless the OS messaging queue dependency. I disagree with that line of thought. I think the messaging overhead is extremely minimal, even on my old computers. You also seem to think that there is only one path from start to wherever you want to go, but that is also not true. There can be quite a few parts in the system interested in the "next step". It is also very easy to just drop in a different module or leave out a module you don't need. Messaging also makes it easier to queue up on tasks. > >>In UZI application part, I'm using a thread pool. >>Each thread can execute jobs (like fetching documents from server). >>However, from the parsing to the mapping it's a single job. >> >>I'm more concerned about memory handling than making the parser run in a >separate thread than the renderer. >>UZI memory allocation all goes to a specific interface (BaseAllocator) >that can decide to reuse objects and so on. >>There is more performance to gain from a correct memory allocation than >from multithreading something that is inherently linear and sequential. That is an interesting approach. Do you have some numbers or something like that to back that up ? I'm not saying I don't believe you. I'd like to know something more about that. Mark -- Spangalese for beginnners: `Wiggilo wagel hoggle?' `Where can I scrub my eyeballs?' |