From: Derek H. <lor...@ms...> - 2002-10-09 02:33:50
|
First, hello everyone. I've joined the group to help develop the Microsoft Office integration components, or at least get them kick-started. :-) "Ashwini Kumar" wrote on 2002-10-02, > I think we need to decide how complicated we would like the search > for eDocs to get. Maybe for the time being we should simply stick to > the RDF framework and make sure that the document creator > associates a metadata with the document. I agree with your proposal of RDF for the time being, Ashwini. WHY RDF? RDF is a mature standard and better-understood than ontologies. It's easier (from a software development standpoint) to employ a simpler metadata approach. I'd like to see a first working release of eDocs as soon as possible, versus a full-featured release that bogs down in complexity and which might not "get out the proverbial door." That's why I'd support a minimal feature set for Version 1, so that Version 2, and 3, etc., can evolve into the eDocs Document Management System Sergio has envisioned with help from a supportive user community. THE MICROSOFT OFFICE INTEGRATION VIEW: A METADATA SEARCH IS BETTER A metadata search approach is easier to implement, less demanding of network bandwidth in the size of serialized SOAP requests, and would be much more reliable across the breadth of "document" types business users will use eDocs with from the Microsoft Office suite of products. Speaking from my investigation of a Microsoft Word Add-in, to store and retrieve a Word document, it would be easiest to provide meta- data (the fields a user optionally fills in as the Document's Properties in Word: title, keywords, author, manager, version, etc.) and then a BASE64 octet-stream that could be stored on the server as an opaque package ("black box"). This wouldn't facilitate full-text search though, and serializing the Word Document's Object Model would be extraordinarily heavyweight (also brittle to different releases of Word, and labor-intensive to code and maintain pre dot NET.) I can foresee sending metadata, the bare text of the Word document as a (very long) xsd:string element so eDocs can search the bare, unadorned text, and an octet-stream (which would still be stored as a black box). This may not be as applicable to other Microsoft Office applications, like Visio diagrams [eg, I want to search for all UML Class Diagrams in eDocs concerning class name "*LexicalHandler".] Actually, Visio 2000 doesn't expose that in it's Object Model (spent a weekend trying to get at it to write a code-gen Addin) though it may be possible to get at it after the internal COM object it uses to handle UML in Visio serializes itself into the .VSD file format. In that case, I wouldn't want the octet-stream to be opaque to the repository. :-) Or, I might use the VBA FileSystemObject in the Add-in to save the diagram to disk in a TMP folder, scan it for text, and forward that text separately to be searchable (then the octet-stream could be opaque, I suppose we want to shoot for consistency there... ) In any event, I definately see issues with searching Visio diagrams for text. Support would be incomplete, at best. (My recommended practice, if a user has a UML Class Diagram in Visio 2000, he or she must list all applicable Class Names as keywords for metadata to search for them reliably. Visio XP, I think, can export UML to XMI, it might need a Microsoft patch to do it, but that could work better.) PROBLEM WITH ADDING FULL-TEXT SEARCH LATER On the other hand (isn't it awful, having two hands? ;-) ). The one problem with adding the sophisticated, full-text search capability later is Migration for upgrading users. We would need either: 1. A Migration Plan for going from a keyword-oriented search facility to a full text-oriented search facility in the future. This would probably involve something tantamount to checking-out and checking-in all revisions (or having smart differencing) of all (the latest) documents. :-( 2. Not support migration formally. Perhaps allow existing documents in a company's repository catalogued with keyword search to remain available for keyword searchs, but be excluded from newer comprehensive text searchs. But for a pre-existing document revision to be subject to comprehensive text search in an upgrading organization, the user would have to check-out and check-in under the new version. So, there could be a problem going from a simpler metadata to a more comprehensive metadata in the future for early-adopters. Administrators will be unhappy if it's not easy to migrate. CONCLUSION I'd still choose the first hand: RDF and searching on metadata about repository documents to begin with. It increases the likelihood eDocs 1.0 happens (if eDocs 1.0 doesn't happen, there'll be no first release to migrate from and so nobody has that problem). Derek Harmon sto...@us... |