[Ebiness-crawler] Storage & parsers
Status: Alpha
Brought to you by:
o3dozone
|
From: Mike D. <md...@3d...> - 2001-06-01 23:14:18
|
Hi guys, Well, time to finally get down to some dirty work! Sellaro and Allan, could you guys start working on a storage system that we can plug into the existing crawler? We've spoken about a few things, namely B trees, ReiserFS and using hashing algos... I think we should go with Sellaro's plan use some form of B tree, if you guys are happy with that? The basic interface should be something like the 'StorageAdapter' class in the 'include/storageadapter.h' file. Mari, I like that idea about using lexx, although I've never played with it - could you have a play? Basically all we need is something that can parse an Xml document into a tree (this includes xHtml), and also be able to handle Html's silliness, like the <p> tag that doesn't get closed... It's another thing that we can plug into the existing crawler and do comparitive tests. I'll have a go at removing the database calls and replacing them with calls to a StorageAdapter. In the meantime, I'll make it use a Gdbm file as the back end. Mike |